Part-of-Speech (POS) Tagging with OpenNLP 1.5.0

Continuing from where I left off, I’m going to quickly touch on part-of-speech tagging before moving on. It’s actually pretty straightforward once you’re set up to run OpenNLP. This all assumes that you’ve already done sentence detection and tokenization. If you haven’t, go back to the beginning. Here are the links to the rest of my posts:

  • How to use the OpenNLP 1.5.0 Parser
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Getting Started

    model files

    Only one additional model file is needed for part-of-speech tagging.


    The Lemur Toolkit

    I was reading Learning Similarity Metrics for Event Identification in Social Media (pdf) and caught a mention of the Lemur Toolkit, which I hadn’t previously heard about.

    They used it for indexing the text representation of documents and, apparently, handling stemming, stop-words, and computing tf-idf vectors. I’ll have to look into this in the future when working with term vectors to see how easy it is to use.

    The toolkit doesn’t appear to be active (final version from June 2010), but can be found at

    Unable to locate the Javac Compiler with Maven and Eclipse

    Unable to locate the Javac Compiler in:
    C:\Program Files\Java\jre6\..\lib\tools.jar
    Please ensure you are using JDK 1.4 or above and
    not a JRE (the class is required).
    In most cases you can change the location of your Java
    installation by setting the JAVA_HOME environment variable.

    The solution that worked for me (tested on both 32- and 64-bit Eclipse/Java) was not to change the eclipse.ini, but to instead set the Runtime JRE on the JRE tab of the Run/Debug Configuration dialog to use the appropriate JDK, either as the “Workspace default JRE” or the “Alternate JRE”

    Does that work for anybody else?