Software and data

See also the software and data that accompany the Publications and Theses of the group. We also provide software and data on GitHub.

Software

  • COVID-19 search engine: An experimental document and snippet retrieval search engine for the CORD-19 (COVID-19) dataset, based on our best BioASQ7 system. [online demo] [paper] [paper] [code]
  • edgar-crawler: An NLP toolkit to download, clean, and extract textual data from financial reports in the United States.
  • Evaluation Measures for Hierarchical Classification: The software that accompanies our paper "Evaluation Measures for Hierarchical Classification: A Unified View and Novel Approaches". Download
  • gr-nlp-toolkit: A natural language processing pipeline for Greek based on pre-trained Transformers. The toolkit supports named entity recognition, part-of-speech tagging, morphological tagging, and dependency parsing. Consult the BSc theses of C. Dikonimaki and N. Smyrnioudis (2021) for more information.
  • Greek BERT: Greek version of BERT, pre-trained on Greek corpora.
  • Greek part-of-speech tagger and dependency parser. Implemented during the MSc thesis of M. Kyriakakis. [thesis] [code]
  • Greek part-of-speech tagger. The tagger attempts to automatically determine the part of speech (e.g., noun, adjective, verb, etc.) of each word occurrence in Greek texts. It can also tag each word occurrence with additional information, such as the gender, number, and case of each noun, the voice, tense, and number of each verb etc.
    • Download (version 2.2 alpha): Minor bug fixes.
    • Download (version 2.1 alpha): This version uses Stanford's Maximum Entropy Classifier (see http://nlp.stanford.edu/software/), it performs better than version 1, and it provides an API. However, it does not yet provide a GUI, nor active learning facilities.
    • Download (version 1): This version uses a k-nearest neighbour classifier. It includes a GUI and active learning facilities, but no API.
  • NaiveBayesSpamDetector: an experimental e-mail spam filter that uses various forms of the Naive Bayes classifier.
  • Named-entity recognizer for Greek texts.
  • NaturalOWL: a natural language generator for OWL ontologies that supports English and Greek; it can be used within Protégé.
  • NLITDB: A prototype natural language interface for temporal databases. Download
  • Sentence compression software: the software of our HLT-NAACL 2010 paper. Download

Data

  • EU/UK RegIR dataset: The dataset of our EACL 2021 paper "Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations." Download
  • AspectTermSimilarities: manually specified similarities between aspect terms of English restaurant and laptop reviews, as used in our EACL 2014 paper "Multi-Granular Aspect Aggregation in Aspect-Based Sentiment Analysis". Download
  • Biomedical word embeddings: English word embeddings pre-trained on biomedical texts from MEDLINE®/PubMed® using the Word2Vec implementation of the gensim toolkit. [Readme] [Embeddings-200D] [Embeddings-400D]
  • Contracts dataset: the dataset of our ICAIL 2017 paper "Extracting Contract Elements". Download
  • EURLEX57K dataset: the dataset of our NLLP Workshop 2019 paper "Extreme Multi-Label Legal Text Classification: A case study in EU Legislation". Download
  • Enron-Spam: contains ham e-mail messages from the Enron corpus and spam messages. Download
  • Gazzetta dataset: The dataset of our papers "Deep Learning for User Comment Moderation" (ACL 2017 workshop "Abusive Content Online") and "Deeper Attention to Abusive User Content Moderation" (EMNLP 2017). Download
  • Ling-Spam: contains ham e-mail messages from a mailing list and spam messages. Download
  • Paraphrases: a collection of sentences and manually scored candidate paraphrases, as used in our EMNLP 2011 paper "A Generate and Rank Approach to Sentence Paraphrasing". Download
  • PU: contains ham e-mail messages (in encoded form) and spam messages. Download