Getting started with OpenNLP 1.5.0 – Sentence Detection and Tokenizing

OpenNLP is a poorly-documented pain in the ass to figure out.  There are various scattered resources you can find on the internet, none of which are particularly thorough, accurate, or up to date.

The most useful that I’ve found is a blog post called Getting started with OpenNLP (Natural Language Processing), but it is over 4 years old and refers to version 1.4.3 (1.5.x is what I’ll discuss here).  That post is quite helpful, but still required digging into the source code to figure out the beast that is coreference resolution.

Here’s to hoping that I can add a few posts to the conversation and help both myself and, perhaps, others…

Most (if not all) of the more advanced OpenNLP components rely on text that is broken into sentences and/or tokens, so I’m starting with those…

Getting Started

jar dependencies

You’ll need three jar files to get started, all can be found in the binary downloads at http://sourceforge.net/projects/opennlp/files/OpenNLP Tools/1.5.0/. After expanding the files, you’ll find opennlp-tools.1.5.0.jar and, under the lib folder, maxent-3.0.0.jar and jwnl-1.3.3.jar (for the Java WordNet Library).

Update: OpenNLP has since moved to Apache, posted maven dependencies, and released version 1.6.0. You can look at the Maven Dependency page for up-to-date information. I am still using 1.5 with this maven dependency:
    <dependency>
      <groupId>org.apache.opennlp</groupId>
      <artifactId>opennlp-tools</artifactId>
      <version>1.5.3</version>
    </dependency>

model files

You’ll probably want the pre-trained model files as a starting point (rather than creating/training your own). They can be found at http://opennlp.sourceforge.net/models-1.5/ and are identified by language and component. For this tutorial you’ll need:

  • en-sent.bin
  • en-token.bin

I use maven, so I just drop these files into src/main/resources and load them with getResourceAsStream, as you’ll see below.

Sentence Detection

The Sentence Detector is actually described well on the OpenNLP SourceForge wiki, so I’ll just quote what’s there (errors theirs, emphasis mine):

The OpenNLP Sentence Detector can detect that a punctuation character marks the end of a sentence or not. In this sense a sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first and last sentence make an exception to this rule. The first non whitespace character is assumed to be the begin of a sentence, and the last non whitespace character is assumed to be a sentence end.

The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.

To the code!

SentenceDetector _sentenceDetector = null;

InputStream modelIn = null;
try {
   // Loading sentence detection model
   modelIn = getClass().getResourceAsStream("/en-sent.bin");
   final SentenceModel sentenceModel = new SentenceModel(modelIn);
   modelIn.close();

   _sentenceDetector = new SentenceDetectorME(sentenceModel);

} catch (final IOException ioe) {
   ioe.printStackTrace();
} finally {
   if (modelIn != null) {
      try {
         modelIn.close();
      } catch (final IOException e) {} // oh well!
   }
}

And then actually using the _sentenceDetector is simple:

_sentenceDetector.sentDetect(content);

Tokenizing

Once again, for these simple components the SourceForge documentation has a good description:

The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.

Tokenizer _tokenizer = null;

InputStream modelIn = null;
try {
   // Loading tokenizer model
   modelIn = getClass().getResourceAsStream(<strong>"/en-token.bin"</strong>);
   final TokenizerModel tokenModel = new TokenizerModel(modelIn);
   modelIn.close();

   _tokenizer = new TokenizerME(tokenModel);

} catch (final IOException ioe) {
   ioe.printStackTrace();
} finally {
   if (modelIn != null) {
      try {
         modelIn.close();
      } catch (final IOException e) {} // oh well!
   }
}

And then, once again, actually using the _tokenizer is simple:

_tokenizer.tokenize(sentence);

Example

Here are the expected results for an example taken from those wiki pages, with an added fourth sentence to show off some edge cases.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Those contraction-less sentences don't have boundary/odd cases...this one does.

Sentences:

  1. Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
  2. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
  3. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
  4. Those contraction-less sentences don't have boundary/odd cases...this one does.

Tokens:

  1. [Pierre] [Vinken] [,] [61] [years] [old] [,] [will] [join] [the] [board] [as] [a] [nonexecutive] [director] [Nov.] [29] [.]
  2. [Mr.] [Vinken] [is] [chairman] [of] [Elsevier] [N.V.] [,] [the] [Dutch] [publishing] [group] [.]
  3. [Rudolph] [Agnew] [,] [55] [years] [old] [and] [former] [chairman] [of] [Consolidated] [Gold] [Fields] [PLC] [,] [was] [named] [a] [director] [of] [this] [British] [industrial] [conglomerate] [.]
  4. [Those] [contraction-less] [sentences] [do] [n't] [have] [boundary/odd] [cases] [...this] [one] [does] [.]

Next Step: Part-of-Speech (POS) Tagging with OpenNLP 1.5.0

My source code and test cases can be found at https://github.com/dpdearing/nlp


Comments

  1. Hi,
    for this example:
    Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Those contraction-less sentences don’t have boundary/odd cases…this one does.

    i expect this result:
    Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov.
    29.
    Mr.
    Vinken is chairman of Elsevier N.V., the Dutch publishing group.
    Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
    Those contraction-less sentences don’t have boundary/odd cases…this one does.

    can you explain me these results?

    many thanks

    Reply
    • Sorry, but I don’t understand the difference. Did some formatting get lost in your comment/question?

      Reply
      • thanks Dave,
        why did the expressions “Nov. 29. ” (expression with dot) is not splitted? that is my question.
        best regards

        Reply
        • OK, I think I understand. Are you asking why, when splitting into sentences, didn’t it split on the ‘.’ punctuation? “Nov.” is an abbreviation for November (and “Mr.” is an abbreviation) so those don’t actually mark the end of a sentence.

          I don’t know the specifics, but I’m sure it’s related to how the model files were trained. It’s smarter than just splitting on all punctuation and it seems to be working correctly to me.

          Reply
  2. Hi,

    for the model files (en-token.bin & en-sent.bin), if I use java in netbean, where should I put it? I’m having trouble because this files are not found by my IDE. Your help will be much appreciated, thank you.

    Reply
    • Sorry akmal, but I don’t use NetBeans. You might try just putting them in the base directory of your project. Loading resource files is a pretty standard thing that you should be able to find help with on Google if you’re having trouble.

      Reply
  3. the model files (en-token.bin & en-sent.bin), if I use java in netbean, where should I put it? I’m having trouble because this files are not found by my IDE. Your help will be much appreciated, thank you.

    Reply
    • I use maven, so putting the model files in src/main/resources adds them to my classpath and can be accessed with getResourceAsStream. Whatever your configuration is, you just need to be able to load those resource files into an InputStream. You could do that with hard-coded paths and normal file I/O if you really want (although, I wouldn’t recommend it).

      Google around and you’ll find an answer. This article about smartly loading your property files might be a good place to start.

      Reply
  4. Excellent examples. Everything works as expected.
    Good work. Made my day. Thanks.

    Reply
  5. hai…i want to ask you regarding the output for the tokenizing part..
    —————————————————————————————-
    TokenizerModel tokenModel = new TokenizerModel(tokmodelIn);
    TokenizerME tokenizer = new TokenizerME(tokenModel);
    String token[] = tokenizer.tokenize(strLine);

    FileOutputStream fout=new FileOutputStream(“D:/NetBeansProjects/my-app/tokenfile.txt”);
    String newline = System.getProperty(“line.separator”);
    for(int i=0;i<token.length;i++){
    fout.write(((i+1)+") "+token[i]+newline).getBytes());
    ———————————————————————————

    I only get the output like this..
    1) [Nowadays]
    2) [in]
    3) [a]
    4) [borderless]
    5) [world]

    not like yours…so, may I know how to get it?

    Reply
    • Hi Xera,

      I don’t understand your question. It looks like you’ve successfully tokenized the sentence “Nowadays in a borderless world”

      Reply
      • yes..I’ve successfully tokenized it..but I don’t want the ouput to be like this:
        1) [Nowadays]
        2) [in]
        3) [a]
        4) [borderless]
        5) [world]

        I prefer the output will be like :
        1) [Nowadays] [in] [a] [borderless] [world]

        Reply
        • That’s part of your own fout.write code in your for-loop, not mine. You’re specifically adding i+1 and newline to each token.

          Reply
  6. hey,
    I tried executing this code, and i got this exception, and i;m not able to figure it out.can u help?
    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 1
    at opennlp.maxent.io.BinToAscii.main(BinToAscii.java:44)

    Reply
    • I can’t offer much help from that information alone. Because of the BinToAscii class name, my best guess is that it’s a problem with the resource files.

      I suggest going through the ‘Getting Started’ section carefully and making sure that all of your resource files are accessible and loaded correctly.

      Reply
  7. for the model files (en-token.bin & en-sent.bin), I downloaded these files as zip format from http://opennlp.sourceforge.net/models-1.5/
    I put it in base directory of my project. But I’m having trouble because this files are not found by my IDE. Your help will be much appreciated, thank you.

    Reply
  8. I was searching for information on the tokenizer’s behaviour on the edge cases you added in your 4th example.
    Thanks a lot !

    Reply
  9. Eclipse is not able to find the file “en-sent.bin”.
    It gives FileNotFoundException.
    What might be the error.
    Please help!

    Reply
  10. I have the same problem about Eclipse not able to find the file “en-pos-maxent.bin”, I tried puting it in the Extension file, changing the classpath, puting it inside the project folder even into build path, giving the full path, nothing worked and I always get the error: “POS tagger model file does not exist! Path: (the path I typed in)”
    And I looked through the comments above still nothing seems to work for me..
    It’s been two days I can’t solve it lease help

    Reply
    • You only need the en-pos-maxent.bin if you’re doing part-of-speech tagging. If you don’t want to load the resources as an input stream like I do with getResourceAsStream, you can instead use the POSModel constructor that takes a file directly.

      Reply
  11. Hi,

    I just want to ask where to place these open-nlp jar files in my java project while using eclipse. Thanks in advance.

    Reply
  12. Very informative example. Thank you!

    Typically, which comes first, sentence boundary detection or tokenization and why?

    I am trying to build a pipeline and am trying to understand the effect of one versus the other in the pipeline.

    Thanks in Advance!

    Reply
    • I do sentence detection first and then send the individual sentences to the tokenizer. You don’t pass tokens to the sentence detector, so it wouldn’t come second in a pipeline.

      Reply

Leave a Reply

Your email address will not be published / Required fields are marked *