OpenNLP Part-of-Speech (POS) Tags: Penn English Treebank

In the comments on my post about part-of-speech tagging, Manu asks

Can you post a legend what the pos tags stand for? At the moment I’m working on a project where I use this and I dont know at the moment how much tags there are and what e.g. “JJ”, “IN” and the rest of them means. This would be very helpful.

Ask and you shall receive!

These are the Penn English Treebank POS tags. Here’s the list that I found in an answer at StackOverflow, but you’re on your own for finding out what each of these really means:

  1. CC Coordinating conjunction
  2. CD Cardinal number
  3. DT Determiner
  4. EX Existential there
  5. FW Foreign word
  6. IN Preposition or subordinating conjunction
  7. JJ Adjective
  8. JJR Adjective, comparative
  9. JJS Adjective, superlative
  10. LS List item marker
  11. MD Modal
  12. NN Noun, singular or mass
  13. NNS Noun, plural
  14. NNP Proper noun, singular
  15. NNPS Proper noun, plural
  16. PDT Predeterminer
  17. POS Possessive ending
  18. PRP Personal pronoun
  19. PRP$ Possessive pronoun
  20. RB Adverb
  21. RBR Adverb, comparative
  22. RBS Adverb, superlative
  23. RP Particle
  24. SYM Symbol
  25. TO to
  26. UH Interjection
  27. VB Verb, base form
  28. VBD Verb, past tense
  29. VBG Verb, gerund or present participle
  30. VBN Verb, past participle
  31. VBP Verb, non­3rd person singular present
  32. VBZ Verb, 3rd person singular present
  33. WDT Wh­determiner
  34. WP Wh­pronoun
  35. WP$ Possessive wh­pronoun
  36. WRB Wh­adverb

How to use the OpenNLP 1.5.0 Parser

After a brief (*cough*cough*) delay, I’m back to figure out how in the world to use this Open NLP Parser. First, a quick refresher:

  • How to use the OpenNLP 1.5.0 Parser (surprise, you’re reading it)
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Getting Started

    I’m only going to warn you once: this is a long post. Go grab a beer or a glass of wine or some coffee before starting. It’s long. Now I’ve warned you twice.

    (more…)

    Part-of-Speech (POS) Tagging with OpenNLP 1.5.0

    Continuing from where I left off, I’m going to quickly touch on part-of-speech tagging before moving on. It’s actually pretty straightforward once you’re set up to run OpenNLP. This all assumes that you’ve already done sentence detection and tokenization. If you haven’t, go back to the beginning. Here are the links to the rest of my posts:

  • How to use the OpenNLP 1.5.0 Parser
  • Making Coreference Resolution your bitch with OpenNLP 1.5.0
  • Getting Started

    model files

    Only one additional model file is needed for part-of-speech tagging.

    (more…)

    The Lemur Toolkit

    I was reading Learning Similarity Metrics for Event Identification in Social Media (pdf) and caught a mention of the Lemur Toolkit, which I hadn’t previously heard about.

    They used it for indexing the text representation of documents and, apparently, handling stemming, stop-words, and computing tf-idf vectors. I’ll have to look into this in the future when working with term vectors to see how easy it is to use.

    The toolkit doesn’t appear to be active (final version from June 2010), but can be found at http://www.lemurproject.org/lemur.php.

    Unable to locate the Javac Compiler with Maven and Eclipse

    Unable to locate the Javac Compiler in:
    C:\Program Files\Java\jre6\..\lib\tools.jar
    Please ensure you are using JDK 1.4 or above and
    not a JRE (the com.sun.tools.javac.Main class is required).
    In most cases you can change the location of your Java
    installation by setting the JAVA_HOME environment variable.

    The solution that worked for me (tested on both 32- and 64-bit Eclipse/Java) was not to change the eclipse.ini, but to instead set the Runtime JRE on the JRE tab of the Run/Debug Configuration dialog to use the appropriate JDK, either as the “Workspace default JRE” or the “Alternate JRE”

    Does that work for anybody else?