Making Coreference Resolution your bitch with OpenNLP 1.5.0

First thing’s first–what is coreference resolution?

Co-reference means that multiple expressions in a sentence or document refer to the same thing. OpenNLP contains a “linker” that analyzes the tokens of a sentences to identify which chunks of text refer to the same things (e.g., people, organizations, events).

Take, for example, the sentence “John drove to Judy’s house. He made her dinner.” In this example both John and He refer to the same entity (John); and Judy and her refer to the same, different entity (Judy). Don’t expect OpenNLP to get this 100% correct. Even a simple example like this is a difficult problem.

Picking up where I left off once upon a time (and finally wrapping up this series), here are links to the old material:

  • How to use the OpenNLP 1.5.0 Parser
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0 (you’re reading it!)
  • (more…)

    OpenNLP Part-of-Speech (POS) Tags: Penn English Treebank

    In the comments on my post about part-of-speech tagging, Manu asks

    Can you post a legend what the pos tags stand for? At the moment I’m working on a project where I use this and I dont know at the moment how much tags there are and what e.g. “JJ”, “IN” and the rest of them means. This would be very helpful.

    Ask and you shall receive!

    These are the Penn English Treebank POS tags. Here’s the list that I found in an answer at StackOverflow, but you’re on your own for finding out what each of these really means:

    1. CC Coordinating conjunction
    2. CD Cardinal number
    3. DT Determiner
    4. EX Existential there
    5. FW Foreign word
    6. IN Preposition or subordinating conjunction
    7. JJ Adjective
    8. JJR Adjective, comparative
    9. JJS Adjective, superlative
    10. LS List item marker
    11. MD Modal
    12. NN Noun, singular or mass
    13. NNS Noun, plural
    14. NNP Proper noun, singular
    15. NNPS Proper noun, plural
    16. PDT Predeterminer
    17. POS Possessive ending
    18. PRP Personal pronoun
    19. PRP$ Possessive pronoun
    20. RB Adverb
    21. RBR Adverb, comparative
    22. RBS Adverb, superlative
    23. RP Particle
    24. SYM Symbol
    25. TO to
    26. UH Interjection
    27. VB Verb, base form
    28. VBD Verb, past tense
    29. VBG Verb, gerund or present participle
    30. VBN Verb, past participle
    31. VBP Verb, non­3rd person singular present
    32. VBZ Verb, 3rd person singular present
    33. WDT Wh­determiner
    34. WP Wh­pronoun
    35. WP$ Possessive wh­pronoun
    36. WRB Wh­adverb

    How to use the OpenNLP 1.5.0 Parser

    After a brief (*cough*cough*) delay, I’m back to figure out how in the world to use this Open NLP Parser. First, a quick refresher:

  • How to use the OpenNLP 1.5.0 Parser (surprise, you’re reading it)
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Getting Started

    I’m only going to warn you once: this is a long post. Go grab a beer or a glass of wine or some coffee before starting. It’s long. Now I’ve warned you twice.


    Part-of-Speech (POS) Tagging with OpenNLP 1.5.0

    Continuing from where I left off, I’m going to quickly touch on part-of-speech tagging before moving on. It’s actually pretty straightforward once you’re set up to run OpenNLP. This all assumes that you’ve already done sentence detection and tokenization. If you haven’t, go back to the beginning. Here are the links to the rest of my posts:

  • How to use the OpenNLP 1.5.0 Parser
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Getting Started

    model files

    Only one additional model file is needed for part-of-speech tagging.


    Getting started with OpenNLP 1.5.0 – Sentence Detection and Tokenizing

    OpenNLP is a poorly-documented pain in the ass to figure out.  There are various scattered resources you can find on the internet, none of which are particularly thorough, accurate, or up to date.

    The most useful that I’ve found is a blog post called Getting started with OpenNLP (Natural Language Processing), but it is over 4 years old and refers to version 1.4.3 (1.5.x is what I’ll discuss here).  That post is quite helpful, but still required digging into the source code to figure out the beast that is coreference resolution.

    Here’s to hoping that I can add a few posts to the conversation and help both myself and, perhaps, others…

  • How to use the OpenNLP 1.5.0 Parser
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Most (if not all) of the more advanced OpenNLP components rely on text that is broken into sentences and/or tokens, so I’m starting with those…