Software and data

See also the software and data that accompany the Publications and Theses of the group.


  • Evaluation Measures for Hierarchical Classification: The software that accompanies our paper "Evaluation Measures for Hierarchical Classification: A Unified View and Novel Approaches". Download
  • Greek part-of-speech tagger. The tagger attempts to automatically determine the part of speech (e.g., noun, adjective, verb, etc.) of each word occurrence in Greek texts. It can also tag each word occurrence with additional information, such as the gender, number, and case of each noun, the voice, tense, and number of each verb etc.
    • Download (version 2.2 alpha): Minor bug fixes.
    • Download (version 2.1 alpha): This version uses Stanford's Maximum Entropy Classifier (see, it performs better than version 1, and it provides an API. However, it does not yet provide a GUI, nor active learning facilities.
    • Download (version 1): This version uses a k-nearest neighbour classifier. It includes a GUI and active learning facilities, but no API.
  • NaiveBayesSpamDetector: an experimental e-mail spam filter that uses various forms of the Naive Bayes classifier.
  • Named-entity recognizer for Greek texts.
  • NaturalOWL: a natural language generator for OWL ontologies that supports English and Greek; it can be used within Protégé.
  • NLITDB: A prototype natural language interface for temporal databases. Download
  • Sentence compression software: the software of our HLT-NAACL 2010 paper. Download


  • AspectTermSimilarities: manually specified similarities between aspect terms of English restaurant and laptop reviews, as used in our EACL 2014 paper "Multi-Granular Aspect Aggregation in Aspect-Based Sentiment Analysis". Download
  • Biomedical word embeddings: English word embeddings pre-trained on biomedical texts from MEDLINE®/PubMed® using the Word2Vec implementation of the gensim toolkit. [Readme] [Embeddings-200D] [Embeddings-400D]
  • Contracts dataset: the dataset of our ICAIL 2017 paper "Extracting Contract Elements". Download
  • EURLEX57K dataset: the dataset of our NLLP Workshop 2019 paper "Extreme Multi-Label Legal Text Classification: A case study in EU Legislation". Download
  • Enron-Spam: contains ham e-mail messages from the Enron corpus and spam messages. Download
  • Gazzetta dataset: The dataset of our papers "Deep Learning for User Comment Moderation" (ACL 2017 workshop "Abusive Content Online") and "Deeper Attention to Abusive User Content Moderation" (EMNLP 2017). Download
  • Ling-Spam: contains ham e-mail messages from a mailing list and spam messages. Download
  • Paraphrases: a collection of sentences and manually scored candidate paraphrases, as used in our EMNLP 2011 paper "A Generate and Rank Approach to Sentence Paraphrasing". Download
  • PU: contains ham e-mail messages (in encoded form) and spam messages. Download