Making Coreference Resolution your bitch with OpenNLP 1.5.0

First thing’s first–what is coreference resolution?

Co-reference means that multiple expressions in a sentence or document refer to the same thing. OpenNLP contains a “linker” that analyzes the tokens of a sentences to identify which chunks of text refer to the same things (e.g., people, organizations, events).

Take, for example, the sentence “John drove to Judy’s house. He made her dinner.” In this example both John and He refer to the same entity (John); and Judy and her refer to the same, different entity (Judy). Don’t expect OpenNLP to get this 100% correct. Even a simple example like this is a difficult problem.

Picking up where I left off once upon a time (and finally wrapping up this series), here are links to the old material:

Getting Started

model files

Coreference resolution uses a folder of pre-trained model libraries. You will need them all, which I put in my local lib/opennlp/coref folder. All of the necessary model files can be found at http://opennlp.sourceforge.net/models-1.4/english/coref/. Be careful when downloading these. On my Macbook, OSX tries to add an incorrect “.txt” extension to several of the files.

I placed these files in lib/opennlp/coref and pass that path directly to the Linker constructor, as you’ll see below.

WordNet Dictionary

You’ll also need the files for the database files for the WordNet 3.0 dictionary. Or you can find them at http://wordnet.princeton.edu/wordnet/download/current-version

Download and extract the files for the dict folder and specify the location as a java VM argument:

-DWNSEARCHDIR=path/to/wordnet/dict

Coreference Linker

Initializing the coreference Linker object is pretty straightforward. Point it to the model folder and specify the LinkerMode. I’m not sure why, but I was only able to get it to work as expected when I used LinkerMode.TEST.

Linker _linker = null;

try {
   // coreference resolution linker
   _linker = new DefaultLinker(
         // LinkerMode should be TEST
         //Note: I tried LinkerMode.EVAL for a long time
         // before realizing that this was the problem
         "lib/opennlp/coref", LinkerMode.TEST);
   
} catch (final IOException ioe) {
   ioe.printStackTrace();
}

Using the Linker to actually extract entity mentions is actually pretty tricky. I had to dig into the OpenNLP source code to get this to work.

Below is a helper function that I created to handle the actual finding entity mentions, which makes use of the OpenNLP Parser on line 11 (see my post on the Parser for that helper method). First, the each sentence parse is used to identify entity mentions. These Mention objects contain information about the entity references.

The Mentions that do not have a corresponding Parse must have one created and set before passing those entity mention objects into the _linker.getEntities method, which returns an array of DiscourseEntity objects.

Good luck:

public DiscourseEntity[] findEntityMentions(final String[] sentences,
      final String[][] tokens) {
   // tokens should correspond to sentences
   assert(sentences.length == tokens.length);
   
   // list of document mentions
   final List<Mention> document = new ArrayList<Mention>();

   for (int i=0; i < sentences.length; i++) {
      // generate the sentence parse tree
      final Parse parse = parseSentence(sentences[i], tokens[i]);
      
      final DefaultParse parseWrapper = new DefaultParse(parse, i);
      final Mention[] extents = _linker.getMentionFinder().getMentions(parseWrapper);

      //Note: taken from TreebankParser source...
      for (int ei=0, en=extents.length; ei<en; ei++) {
         // construct parses for mentions which don't have constituents
         if (extents[ei].getParse() == null) {
            // not sure how to get head index, but it doesn't seem to be used at this point
            final Parse snp = new Parse(parse.getText(), 
                  extents[ei].getSpan(), "NML", 1.0, 0);
            parse.insert(snp);
            // setting a new Parse for the current extent
            extents[ei].setParse(new DefaultParse(snp, i));
         }
      }
      document.addAll(Arrays.asList(extents));
   }

   if (!document.isEmpty()) {
      return _linker.getEntities(document.toArray(new Mention[0]));
   }
   return new DiscourseEntity[0];
}

Example

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

And here are the groups of entities that OpenNLP will extract, including (for some reason) the trailing space. Unfortunately, they aren’t very accurate:

  • [this British industrial conglomerate ]
  • [a nonexecutive director ] , [chairman ] , [former chairman ] , [a director ]
  • [Consolidated Gold Fields PLC ]
  • [55 years ]
  • [Rudolph Agnew ]
  • [Elsevier N.V. ] , [the Dutch publishing group ]
  • [Pierre Vinken ] , [Mr. Vinken ]
  • [Nov. 29 ]
  • [the board ]
  • [61 years ]

Update: If using ParserTool.parseLine (as described here) instead of the procedure described in my post on the Parser, I get the following results instead (including additional trailing punctuation in entity names and what appears to be even worse entity resolution):

  • [this British industrial conglomerate. ]
  • [chairman ]
  • [former chairman ]
  • [a director ]
  • [Consolidated Gold Fields PLC, ]
  • [55 years ]
  • [Rudolph Agnew, ]
  • [Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, ]
  • [the Dutch publishing group. ]
  • [Elsevier N.V. ]
  • [Mr. Vinken ]
  • [a nonexecutive director Nov. ]
  • [the board ]
  • [Pierre Vinken, 61 years ]

Comments

  1. Thanks very much for posting this, Dave. I have searched high and low and I think yours is the only example of the OpenNLP coref tool on the entire Internet. Even the OpenNLP website has nothing.

    I noticed a couple things with the code:
    1 – In the first code block, modelIn is never used
    2 – In the second code block, there is a call to parseSentence() that is not defined; it would be great if you could post this method too.

    I’m not surprised that you found the results to be inaccurate; I just did some testing with the Stanford coref library (which is supposedly the best), and the results were pretty horrible: about half of them were wrong. Also, the Stanford library required about 1 gig of memory.

    • Please ignore point #2; I didn’t see the link you had posted:
      http://blog.dpdearing.com/2011/12/how-to-use-the-opennlp-1-5-0-parser/

    • Good catch on not using the model. Oops! That’s a copy+paste artifact from when I initialized the other OpenNLP objects. I’ll update the post (and clarify the part about parseSentence).

      I’ve intended to dig into the Stanford NLP library for a while now, but just haven’t made the time. At least it doesn’t sound like I’m not missing out on a lot! :) I’m not sure if I noted this in any of my posts, but I also upped the memory when running OpenNLP to 1GB. Although, I’m not sure that it actually needs that much, but the default certainly doesn’t cut it.

      Thanks for the feedback, John!

  2. Would you please send your code to me by email? Thank you very much. I do somethig obout Coreference Resolution now, but I readlly do not know how to start it with the tools. Looking forward to your reply.

    • Sorry, but no. You should be able to put it together from the examples on my blog (see the links at the top of this post). Pretty much all of my example code is included or described in those posts.

  3. Dave thank you very much. I have been able to get everything working with little difficulty. I am currently working on creating a UIMA annotator version similar to the ones provided in the UIMA-OpenNLP distribution. As a follow up question, what does it mean to have a DiscourseEntity with only one mention? As I understand it, a DiscourseEntity should have a set of Mentions greater than 1 where say multiple pronouns would refer to a single proper noun?

    • Hi Paul,

      My understanding is that you will get a DiscourseEntity for every entity mention, whether it is mentioned once or multiple times. If an entity is only mentioned once, it still exists in the text but doesn’t have multiple references. I think “the board” entity in the example at the end of the post fits this scenario.

  4. Hello Paul. Do you have a similar reference that explains what the OpenNLP “similarity” metric is and what it is useful for?

  5. Hi Dave,

    As we’ve discussed through StackOverflow (http://stackoverflow.com/questions/8629737/coreference-resolution-using-opennlp/13750274?noredirect=1#comment40429091_13750274),

    The ParserTool.parseLine function expects a sentence with its words tokenized, separated by space, like this “I , the king , work there ( you know ) .”
    You can check the source code of parseLine to see that it’s only using StringTokenizer (and actually applied some parenthesis separating regex, so the above sentence can be as well be written as “I , the king , work there (you know) .”)

    Regarding the use of NameFinder before coreference, it’s mentioned by one of its developer here:
    http://grokbase.com/t/opennlp/dev/126dbtq7ec/how-to-work-with-coreference-resolutions#2012061443ejxxhqzpqzpxamchhgq5ktrm

Leave a Reply

Your email address will not be published / Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>