The Lemur Toolkit

I was reading Learning Similarity Metrics for Event Identification in Social Media (pdf) and caught a mention of the Lemur Toolkit, which I hadn’t previously heard about.

They used it for indexing the text representation of documents and, apparently, handling stemming, stop-words, and computing tf-idf vectors. I’ll have to look into this in the future when working with term vectors to see how easy it is to use.

The toolkit doesn’t appear to be active (final version from June 2010), but can be found at

