AUEB's NLP Group

Software and data

See also the software and data that accompany the Publications and Theses of the group. We also provide software and data on GitHub.

Software

GR-NLP-TOOLKIT: An open-source Natural Language Processing toolkit for Modern Greek. Provides part-of-speech/morphological tagging, dependency parsing, named entity recognition, Greeklish-to-Greek conversion. [paper] [code]
COVID-19 search engine: An experimental document and snippet retrieval search engine for the CORD-19 (COVID-19) dataset, based on our best BioASQ7 system. [paper] [paper] [code]
edgar-crawler: An NLP toolkit to download, clean, and extract textual data from financial reports in the United States.
Evaluation Measures for Hierarchical Classification: The software that accompanies our paper "Evaluation Measures for Hierarchical Classification: A Unified View and Novel Approaches". Download
Greek BERT: Greek version of BERT, pre-trained on Greek corpora.
Greek part-of-speech tagger and dependency parser. Implemented during the MSc thesis of M. Kyriakakis. [thesis] [code] (OLD - see the more recent GR-NLP-TOOLKIT)
Greek part-of-speech tagger. The tagger attempts to automatically determine the part of speech (e.g., noun, adjective, verb, etc.) of each word occurrence in Greek texts. It can also tag each word occurrence with additional information, such as the gender, number, and case of each noun, the voice, tense, and number of each verb etc.
- Download (version 2.2 alpha): Minor bug fixes.
- Download (version 2.1 alpha): This version uses Stanford's Maximum Entropy Classifier (see http://nlp.stanford.edu/software/), it performs better than version 1, and it provides an API. However, it does not yet provide a GUI, nor active learning facilities.
- Download (version 1): This version uses a k-nearest neighbour classifier. It includes a GUI and active learning facilities, but no API.
(OLD - see the more recent GR-NLP-TOOLKIT)
NaiveBayesSpamDetector: an experimental e-mail spam filter that uses various forms of the Naive Bayes classifier.
- Download (version 1.3)
Named-entity recognizer for Greek texts.
- Download (version 2): Recognizes temporal expressions, person names, and organization names.
- Download (version 1): Recognizes temporal expressions and person names.
(OLD - see the more recent GR-NLP-TOOLKIT)
NaturalOWL: a natural language generator for OWL ontologies that supports English and Greek; it can be used within Protégé.
- Download (version 2.0): This is the version used in our JAIR article of 2013.
- Download (version 1.1): Minor bug fixes concerning compatibility with Linux and Mac computers.
- Download (version 1.0): NaturalOWL's first release.
NLITDB: A prototype natural language interface for temporal databases. Download
Sentence compression software: the software of our HLT-NAACL 2010 paper. Download

Data

EU/UK RegIR dataset: The dataset of our EACL 2021 paper "Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations." Download
AspectTermSimilarities: manually specified similarities between aspect terms of English restaurant and laptop reviews, as used in our EACL 2014 paper "Multi-Granular Aspect Aggregation in Aspect-Based Sentiment Analysis". Download
Biomedical word embeddings: English word embeddings pre-trained on biomedical texts from MEDLINE®/PubMed® using the Word2Vec implementation of the gensim toolkit. [Readme] [Embeddings-200D] [Embeddings-400D]
Contracts dataset: the dataset of our ICAIL 2017 paper "Extracting Contract Elements". Download
EURLEX57K dataset: the dataset of our NLLP Workshop 2019 paper "Extreme Multi-Label Legal Text Classification: A case study in EU Legislation". Download
Enron-Spam: contains ham e-mail messages from the Enron corpus and spam messages. Download
Gazzetta dataset: The dataset of our papers "Deep Learning for User Comment Moderation" (ACL 2017 workshop "Abusive Content Online") and "Deeper Attention to Abusive User Content Moderation" (EMNLP 2017). Download
Ling-Spam: contains ham e-mail messages from a mailing list and spam messages. Download
Paraphrases: a collection of sentences and manually scored candidate paraphrases, as used in our EMNLP 2011 paper "A Generate and Rank Approach to Sentence Paraphrasing". Download
PU: contains ham e-mail messages (in encoded form) and spam messages. Download

Natural Language Processing Group

Department of Informatics - Athens University of Economics and Business

Software and data

Software

Data

Pages developed by: D. Galanis
Pages maintained by: Ch. Vlachos , N. Nikolaidis , S. Eleftheriou
Last update: 25-10-2025

Valid XHTML | CSS

Natural Language Processing Group

Department of Informatics - Athens University of Economics and Business

Software and data

Software

Data

Pages developed by: D. Galanis Pages maintained by: Ch. Vlachos , N. Nikolaidis , S. Eleftheriou Last update: 25-10-2025

Valid XHTML | CSS

Pages developed by: D. Galanis
Pages maintained by: Ch. Vlachos , N. Nikolaidis , S. Eleftheriou
Last update: 25-10-2025