Making Coreference Resolution your bitch with OpenNLP 1.5.0

First thing’s first–what is coreference resolution?

Co-reference means that multiple expressions in a sentence or document refer to the same thing. OpenNLP contains a “linker” that analyzes the tokens of a sentences to identify which chunks of text refer to the same things (e.g., people, organizations, events).

Take, for example, the sentence “John drove to Judy’s house. He made her dinner.” In this example both John and He refer to the same entity (John); and Judy and her refer to the same, different entity (Judy). Don’t expect OpenNLP to get this 100% correct. Even a simple example like this is a difficult problem.

Picking up where I left off once upon a time (and finally wrapping up this series), here are links to the old material:

Getting Started

model files

Coreference resolution uses a folder of pre-trained model libraries. You will need them all, which I put in my local lib/opennlp/coref folder. All of the necessary model files can be found at Be careful when downloading these. On my Macbook, OSX tries to add an incorrect “.txt” extension to several of the files.

I placed these files in lib/opennlp/coref and pass that path directly to the Linker constructor, as you’ll see below.

WordNet Dictionary

You’ll also need the database files for the WordNet dictionary. You can find them at, but (as Samyak noted in the comments) the WordNet 3.0 “just database files” seems to be incomplete. Instead, get the dict folder from either the full source code and binaries or try the link for the WordNet 3.1 database files.

Once you have the necessary dict folder, specify its location as a java VM argument:


Coreference Linker

Initializing the coreference Linker object is pretty straightforward. Point it to the model folder and specify the LinkerMode. I’m not sure why, but I was only able to get it to work as expected when I used LinkerMode.TEST.

Linker _linker = null;

try {
   // coreference resolution linker
   _linker = new DefaultLinker(
         // LinkerMode should be TEST
         //Note: I tried LinkerMode.EVAL for a long time
         // before realizing that this was the problem
         "lib/opennlp/coref", LinkerMode.TEST);
} catch (final IOException ioe) {

Using the Linker to actually extract entity mentions is actually pretty tricky. I had to dig into the OpenNLP source code to get this to work.

Below is a helper function that I created to handle the actual finding entity mentions, which makes use of the OpenNLP Parser on line 11 (see my post on the Parser for the parseSentence helper method). First, the each sentence parse is used to identify entity mentions. These Mention objects contain information about the entity references.

The Mentions that do not have a corresponding Parse must have one created and set before passing those entity mention objects into the _linker.getEntities method, which returns an array of DiscourseEntity objects.

I’ve highlighted the important lines. Good luck:

public DiscourseEntity[] findEntityMentions(final String[] sentences,
      final String[][] tokens) {
   // tokens should correspond to sentences
   assert(sentences.length == tokens.length);
   // list of document mentions
   final List<Mention> document = new ArrayList<Mention>();

   for (int i=0; i < sentences.length; i++) {
      // generate the sentence parse tree
      final Parse parse = parseSentence(sentences[i]);
      final DefaultParse parseWrapper = new DefaultParse(parse, i);
      final Mention[] extents = _linker.getMentionFinder().getMentions(parseWrapper);

      //Note: taken from TreebankParser source...
      for (int ei=0, en=extents.length; ei<en; ei++) {
         // construct parses for mentions which don't have constituents
         if (extents[ei].getParse() == null) {
            // not sure how to get head index, but it doesn't seem to be used at this point
            final Parse snp = new Parse(parse.getText(), 
                  extents[ei].getSpan(), "NML", 1.0, 0);
            // setting a new Parse for the current extent
            extents[ei].setParse(new DefaultParse(snp, i));

   if (!document.isEmpty()) {
      return _linker.getEntities(document.toArray(new Mention[0]));
   return new DiscourseEntity[0];


Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.

And here are the groups of entities that OpenNLP will extract, including (for some reason) the trailing space. Unfortunately, they aren’t very accurate:

  • [this British industrial conglomerate ]
  • [a nonexecutive director ] , [chairman ] , [former chairman ] , [a director ]
  • [Consolidated Gold Fields PLC ]
  • [55 years ]
  • [Rudolph Agnew ]
  • [Elsevier N.V. ] , [the Dutch publishing group ]
  • [Pierre Vinken ] , [Mr. Vinken ]
  • [Nov. 29 ]
  • [the board ]
  • [61 years ]
Update: I get different results if using ParserTool.parseLine(..)as described here—instead of the procedure described in my post on the Parser. Entity names include additional trailing punctuation and entity resolution appears to be even worse:

  • [this British industrial conglomerate. ]
  • [chairman ]
  • [former chairman ]
  • [a director ]
  • [Consolidated Gold Fields PLC, ]
  • [55 years ]
  • [Rudolph Agnew, ]
  • [Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, ]
  • [the Dutch publishing group. ]
  • [Elsevier N.V. ]
  • [Mr. Vinken ]
  • [a nonexecutive director Nov. ]
  • [the board ]
  • [Pierre Vinken, 61 years ]

My source code and test cases can be found at


  1. Thanks very much for posting this, Dave. I have searched high and low and I think yours is the only example of the OpenNLP coref tool on the entire Internet. Even the OpenNLP website has nothing.

    I noticed a couple things with the code:
    1 – In the first code block, modelIn is never used
    2 – In the second code block, there is a call to parseSentence() that is not defined; it would be great if you could post this method too.

    I’m not surprised that you found the results to be inaccurate; I just did some testing with the Stanford coref library (which is supposedly the best), and the results were pretty horrible: about half of them were wrong. Also, the Stanford library required about 1 gig of memory.

    • Please ignore point #2; I didn’t see the link you had posted:

    • Good catch on not using the model. Oops! That’s a copy+paste artifact from when I initialized the other OpenNLP objects. I’ll update the post (and clarify the part about parseSentence).

      I’ve intended to dig into the Stanford NLP library for a while now, but just haven’t made the time. At least it doesn’t sound like I’m not missing out on a lot! πŸ™‚ I’m not sure if I noted this in any of my posts, but I also upped the memory when running OpenNLP to 1GB. Although, I’m not sure that it actually needs that much, but the default certainly doesn’t cut it.

      Thanks for the feedback, John!

  2. Would you please send your code to me by email? Thank you very much. I do somethig obout Coreference Resolution now, but I readlly do not know how to start it with the tools. Looking forward to your reply.

    • Sorry, but no. You should be able to put it together from the examples on my blog (see the links at the top of this post). Pretty much all of my example code is included or described in those posts.

  3. Dave thank you very much. I have been able to get everything working with little difficulty. I am currently working on creating a UIMA annotator version similar to the ones provided in the UIMA-OpenNLP distribution. As a follow up question, what does it mean to have a DiscourseEntity with only one mention? As I understand it, a DiscourseEntity should have a set of Mentions greater than 1 where say multiple pronouns would refer to a single proper noun?

    • Hi Paul,

      My understanding is that you will get a DiscourseEntity for every entity mention, whether it is mentioned once or multiple times. If an entity is only mentioned once, it still exists in the text but doesn’t have multiple references. I think “the board” entity in the example at the end of the post fits this scenario.

  4. Hello Paul. Do you have a similar reference that explains what the OpenNLP “similarity” metric is and what it is useful for?

  5. Hi Dave,

    As we’ve discussed through StackOverflow (,

    The ParserTool.parseLine function expects a sentence with its words tokenized, separated by space, like this “I , the king , work there ( you know ) .”
    You can check the source code of parseLine to see that it’s only using StringTokenizer (and actually applied some parenthesis separating regex, so the above sentence can be as well be written as “I , the king , work there (you know) .”)

    Regarding the use of NameFinder before coreference, it’s mentioned by one of its developer here:

    • Personally, I don’t feel like creating a tokenized sentence string for ParseTool.parseLine is easier than my approach above. However, thanks for pointing out the problem. I’ll update the post when I have a chance.

      I’ll also take a look at that link regarding the NameFinder. Thanks! It’s hard to believe there is so much hidden information scattered out there about OpenNLP.

    • Aldrian,

      I tried changing my sample to use NameFinder as described in the link you pointed to, but so far it doesn’t seem to make a difference in the coreference output. I’ll have to dig in a little more to see if that’s really the case (and to make sure that I’m doing it correctly).

      Or maybe the sample above isn’t complicated enough to take advantage of it.

      • It might be the case that someone (like me) already have the code to tokenize a sentence and just want to parse the sentence. This is where `ParseTool.parseLine` is useful.

        Regarding the use of `NameFinder`, yes, it might not be apparent in some cases, but it does affect the result. For example this one:

        Edward is here.
        Mary is there.
        He met her.

        With NameFinder, the result is:


        Without NameFinder, the result is:


  6. Hello,
    can send me the implementation of the method below, please?
    parseSentence(sentences[i], tokens[i])

    • Hi Jamilson.

      As noted in the post, that helper method can be found in my post on the OpenNLP Parser.

      • Hello, I implemented all the steps in the tutorials, but I could not get my code to run

        I am using the following template: en-parser-chunking.bin
        Not found: en-parser-chunker.bin

        His method: parseSentence (final String text) only takes one parameter.
        In the last post, you have parseSentence (sentences [i], tokens [i]). The tokens parameter [i] is not used?

        If you have code running there, could send me by email, please.

        • Thanks for catching that! Those are typos that I missed after updating these posts a few times. You are right about en-parser-chunking and not needing tokens[i].

  7. Dave,
    I’m trying to get this co-references work. The linker and co-ref are 2 things I didn’t understand well.
    Is it possible to email you the details?

  8. Is there a working sample anywhere? I couldn’t get this to work.

  9. Hi Dave,
    Nice blog post by you here – the best I have found on OpenNLP across the internet.
    I am getting “ ./dictionary/adj.exc (No such file or directory)”,
    I tried the absolute path but it didn’t help either.
    On checking the dictionary folder (which contains the database files for the WordNet 3.0 dictionary) I realized it doesn’t have any adj.exc file either.
    Can you please help me out?

    Thanks πŸ™‚

    • You’re right! I don’t know if something has changed, but the WordNet 3.0 “just the database files” download looks incomplete.

      If you instead go to you can either get the full WordNet 3.0 source code and binaries, which has a `dict` folder containing all of the files, or it looks like there is a new “WordNet 3.1 DATABASE FILES ONLY” section that has the additional files and should work (but I haven’t tried them myself)

      I’ll update the link in the post to remove the direct download link

  10. can we get same code using c#

  11. Hi Dave,

    I am wondering can we differentiate a sentence whether it is a Question or a Statement using openNLP ? If so can you give me some pointers how to achieve it.

    Thanks in advance.

  12. Hi,

    get below error while running entity co-referencing

    Exception in thread "main" java.util.NoSuchElementException
    	at java.util.StringTokenizer.nextToken(Unknown Source)
    	at net.didion.jwnl.dictionary.FileBackedDictionary.parseAndCacheIndexWordLine(
    	at net.didion.jwnl.dictionary.FileBackedDictionary.getIndexWord(
    	at net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor.lookupNextBaseForm(
    	at net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor.lookupAllBaseForms(
  13. Thanks for Dave for opennlp article.
    while performing entity co-referencing using opennlp – for some of the statements I get an error “Couldn’t find parse for “. What could be reason? whole execution stops. But if I set reportFailedParse = false, error is masked but program stops. What causes this?

    When I see the statements looks normal to me. Kindly help


    • If there is no text after the “Couldn’t find parse for:” message, then I would guess that the Parse object for your text is empty.

      You should probably step debug and step through the code in findEntityMentions to see what’s going on in your specific example.

  14. Hi Dave, will debug as u suggested guess I need to attach source code of opennlp… but it is “in between sentence” and there is text after this error…since this error occurs “in between” entire execution stops. Just wondering if need to train “chunk parser”/”linker”? Actually it looks to me a normal sentence. I will send details shortly.

    Actually there is some related link

    error says this happens of pos fails for sentence – top 10 tags – should I increase this value?

    • Hi Dave, Found where this error comes.
      findEntityMentions -> parseSentence -> parse β€” _parser.parse(p) fails for below sample sentence – probably you can check
      This actually invokes

      β€œMore recently has mutated acceptance more supple encompassing duty act fairly ( significantly derived from Lord Reid ’ s speech [ 1964 ] AC 40 , particularly 1989 ( 4 ) SA 731 ( A ) more recently , [ 1993 ] 3 All ER 92 ( HL ) 106d – h .’”

  15. Thanks, got it up and running with this!
    I downloaded the WordNet 3.1 files and apparently the java code expected the file names in a different format, so I had to do some renaming (e.g. data.adj to adj.dat, index.noun to noun.idx, etc.), but other than that got through it without any problems, so thanks!

    • Good point peter. Now that you mention it I remember having to rename the files for different platforms. I’m assuming that the adj.dat naming was needed on Windows?

  16. Am able to see the package only till version 1.5.3. Some of the JIRA issues (e.g. OPENNLP-36 resolved on 16-Jan-2017) are mentioning, “Development Stopped”. Am not able to Google additional details. If you know, could you please throw some light. Thanks.

    • All of these posts are about OpenNLP 1.5, which works with the 1.4 core models. I haven’t used anything newer than 1.5.3, but it’s possible that the coreference component is no longer under development. Sorry that I don’t have more information for you!


Leave a Reply

Your email address will not be published / Required fields are marked *