How to use the OpenNLP 1.5.0 Parser

After a brief (*cough*cough*) delay, I’m back to figure out how in the world to use this Open NLP Parser. First, a quick refresher:

Getting Started

I’m only going to warn you once: this is a long post. Go grab a beer or a glass of wine or some coffee before starting. It’s long. Now I’ve warned you twice.

model files

Only one additional model file is needed for parsing (which also seems to include noun phrase chunking). That said, you don’t need to know how to do any noun phrase chunking on your own.

  • en-parser-chunking.bin

As with all of the model files, it can be found at http://opennlp.sourceforge.net/models-1.5/ and are identified by language and component. There’s no info provided on this one, but I’m guessing that it was also trained on the CoNLL 2000 shared task data (as is en-chunker.bin, which is used for noun phrase chunking).

I use maven, so these files go into src/main/resources and are loaded with getResourceAsStream, as you’ll see below.

Parsing

So what is Parsing? The Parser page on the OpenNLP SourceForge wiki defines the Parser as:

TODO: Write an introduction for the parser.

Well that’s extremely helpful–those guys sure do a great job over there. What it actually does is takes a sentence like this:

The quick brown fox jumps over the lazy dog.

and turns it into a parse tree with part-of-speech tags that looks like this:

(TOP (NP (NP (DT The) (JJ quick) (JJ brown) (NN fox) (NNS jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))(. .)))

Creating a Parse object

First, for some silly reason, you need to create your own Parse object. Yes, before parsing you create a Parse object. Strange, no?

Update: As iosu notes in the comments, all of this logic to create a Parse object could be replaced with a simple call to ParserTool.parseLine(sentence, _parser, 1) after initializing _parser as shown below.

However, I’ve noticed that the resulting parse does not have punctuation separately tokenized (i.e., in the example parse tree above, (NN dog) is now (NN dog.)) which leads to some differences during Coreference Resolution.

This code uses the _tokenizer so before moving on make sure that you’ve already tackled sentence detection and tokenization before proceeding.

No really, go read that link. I’m not fucking around.

Done? OK, here’s how to create your own Parse from an array of tokens:

Update: Thanks to a comment by Jonathan Huts, I’ve simplified the following code to use the Tokenizer’s tokenizePos method, which will save you from manually creating the individual token spans.
private Parse parseSentence(final String text) {
   final Parse p = new Parse(text,
         // a new span covering the entire text
         new Span(0, text.length()),
         // the label for the top if an incomplete node
         AbstractBottomUpParser.INC_NODE,
         // the probability of this parse...uhhh...? 
         1,
         // the token index of the head of this parse
         0);

   // make sure to initialize the _tokenizer correctly
   final Span[] spans = _tokenizer.tokenizePos(text);

   for (int idx=0; idx < spans.length; idx++) {
      final Span span = spans[idx];
      // flesh out the parse with individual token sub-parses 
      p.insert(new Parse(text,
            span,
            AbstractBottomUpParser.TOK_NODE, 
            0,
            idx));
   }

   Parse actualParse = parse(p);
}

Still with me? I’m impressed. Go get a refill on whatever you’re drinking (you are drinking, right?). We’re almost done!

Parsing a Parse

Now that you’ve actually created a Parse object you can…well…parse it! Watch the magic unfold:

private Parser _parser = null;

private Parse parse(final Parse p) {
   // lazy initializer
   if (_parser == null) {
      InputStream modelIn = null;
      try {
         // Loading the parser model
         modelIn = getClass().getResourceAsStream("/en-parser-chunker.bin");
         final ParserModel parseModel = new ParserModel(modelIn);
         modelIn.close();
         
         _parser = ParserFactory.create(parseModel);
      } catch (final IOException ioe) {
         ioe.printStackTrace();
      } finally {
         if (modelIn != null) {
            try {
               modelIn.close();
            } catch (final IOException e) {} // oh well!
         }
      }
   }
   return _parser.parse(p);
}

That’s it! The actual parsing isn’t really any different from the other OpenNLP tools, but creating that initial Parse object isn’t exactly spelled out very clearly elsewhere.

Hope it helps, drop a comment if you have any problems or just to give a shout-out!


Next Step: Making Coreference Resolution your bitch

My source code and test cases can be found at https://github.com/dpdearing/nlp


Comments

  1. Thanks for your post on OpenNLP–they are a welcome resource given the state of the project’s documentation… I’m trying to figure out what POS tagger the Parser uses by default. Do you know? I want to use the parser model which seems to work quite well, but I’d like to do my own tagging and pass in the tagged sentence to the Parser. I’m new to OpenNLP so am still trying to figure out if there is a way to do this.

    Reply
  2. Hey Buddy,

    thanks for your very helpful articles on using OpenNLP.
    I’m still looking forward for the article called “Making Coreference Resolution with OpenNLP 1.5.0 your bitch” πŸ™‚
    I noticed you didn’t blog anything in 2012… Thats pretty sad…
    Hope you will start writing your next article soon πŸ˜‰

    Cheeeeeeeeeeeers

    Reply
  3. Your OpenNLP examples have been really useful to me in getting started with the library, as understanding what pieces of the pipeline you need to get the results you’re after is not that clear when initially diving into the code and official docs. Though, to be fair, I think you can now piece useful code together from the examples on there, once you’re confident which pieces you need. I like to know I’m not missing a shortcut, basically.

    As far as you know, is there anything like POSTagger.topKSequences at the Parser level? If so, do you think any next-best parses would resolve “jumps” to VBZ instead of NNS? It’s curious that the top parse is a fragment when the example can be read as a full clause.

    Reply
  4. Just a quick note: it’s much more efficient to use tokenizePos(), which returns a list of Span objects that have start and stop indices encoded within them, than tokenize(). That way, the calls to findTokenCharacterStart() can be eliminated.

    Reply
  5. MUNI PRASHANTHI R
    January 30, 2013 - 9:01 pm

    hi i would like to know how to use apache opennlp developer documentation software , which consists of different tools of machine learning tasks such as sentence segmentation,tekenization and so on…..so please tell me how to download it , how to install it and how to run the commands in that application..
    i hope any one of them mail me regarding this…
    thanks & regards
    MUNI PRASHANTHI R

    Reply
  6. nice post! thanks for sharing!

    just one note, nowadays the Parse/span struggle can be addressed with a simple
    (opennlp.tools.cmdline.parser.ParserTool)

    ParserTool.parseLine(sentence,_parser,1)

    where sentence is.. ehm.. a sentence πŸ™‚
    _parser is the parser created by a Factory
    and “1” is the number of parses

    cheers!

    Reply
    • Thanks iosu! Never found that due to the lack of good OpenNLP documentation–at least, back when I was first attempting this. Hopefully (?) that’s changed for the better now.

      I’ve been meaning to revisit some of these posts to make a few updates. I’ll add this to my list!

      Reply
      • yeah i agree, documentation was not the best part of opennlp… and still is not!
        i am struggling with correferencen now but looks like it is never going to be my bitch πŸ˜‰

        Reply
    • iosu, I tried using parseLine as you suggested, but the resulting parse does not have punctuation separately tokenized, which leads to some changes during entity detection/coreference resolution. I’ve added an update to the post.

      Reply
  7. Thanks for the post. But i need the parsed sentence as output in the form of string or something like that so that i can apply some string functions and manipulate data. Need help. !!

    Reply
  8. Stumbled across this and found it very helpful. I was wondering if you knew whether it’s possible to provide the parser with your own POS tagged tokens instead of having the parser do everything?

    Reply
  9. Just started with NLP (and openNLP in particular) and I found your posts really helpful!
    I would suggest another good website source: http://www.programcreek.com/2012/05/opennlp-tutorial/#parser.
    Here I found out that the parse code can be really shrunk in few lines without the need on multiple Parse object, just like this:

    Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);

    for (Parse p : topParses)
    p.show();

    Reply
  10. Just wanted to let you know I appreciate the humor, I’ve LOLed twice even before model file section! And thanks for the tutorial as well.

    Reply
  11. Dear all,

    I know this might seem stupid but I need to store the shown string (from the parser p.show) in a string value as I need it to construct a syntax tree for further use.

    Regards

    Mark Spiteri

    Reply
  12. hey ,
    I’m trying to parse a resume/CV .first step to do i will separate the different parts of my CV: Personal informations,education , skills , inerests ….
    so to do that is it right to use the Parse Tree of OpenNLP to make sure that the different part are separated and the text that exist after is the value .

    some help please .

    Reply
    • I’m not sure if I really understand your question. You’d want want to separate the text for each section first and then go through the NLP steps. The Parser won’t detect the sections for you.

      Reply

Leave a Reply

Your email address will not be published / Required fields are marked *