Getting started with OpenNLP 1.5.0 – Sentence Detection and Tokenizing

OpenNLP is a poorly-documented pain in the ass to figure out.  There are various scattered resources you can find on the internet, none of which are particularly thorough, accurate, or up to date.

The most useful that I’ve found is a blog post called Getting started with OpenNLP (Natural Language Processing), but it is over 4 years old and refers to version 1.4.3 (1.5.x is what I’ll discuss here).  That post is quite helpful, but still required digging into the source code to figure out the beast that is coreference resolution.

Here’s to hoping that I can add a few posts to the conversation and help both myself and, perhaps, others…

  • Most (if not all) of the more advanced OpenNLP components rely on text that is broken into sentences and/or tokens, so I’m starting with those…