Software and Data

See also the software and data that accompany the Publications and Theses of the group. We also provide software and data on GitHub.

Software
  • SynDisco: A lightweight, simple and specialized framework used for creating, storing, annotating and analyzing synthetic discussions between LLMs. [Webpage] [Code]
  • Apunim: A statistical, quantifiable metric that attributes polarization to annotator subgroups. Used to surface minority viewpoints in senstitive NLP tasks. Includes a p-value test. [Webpage] [Code]
  • GR-NLP-TOOLKIT: An open-source Natural Language Processing toolkit for Modern Greek. Provides part-of-speech/morphological tagging, dependency parsing, named entity recognition, Greeklish-to-Greek conversion. [Paper] [Code]
  • SEC-BERT: a family of BERT models for the financial domain (English), pre-trained on U.S. SEC EDGAR filings and released with our FiNER paper (ACL 2022). [Paper] [SEC-BERT-BASE] [SEC-BERT-NUM] [SEC-BERT-SHAPE]
  • COVID-19 search engine: An experimental document and snippet retrieval search engine for the CORD-19 (COVID-19) dataset, based on our best BioASQ7 system. [Paper] [Paper] [Code]
  • EDGAR-CRAWLER : An open-source toolkit that converts raw, unstructured U.S. SEC EDGAR filings into clean and section-level structured JSON data. Presented at WWW 2025. [Paper] [Code]
  • Evaluation Measures for Hierarchical Classification: The software that accompanies our paper "Evaluation Measures for Hierarchical Classification: A Unified View and Novel Approaches". Download
  • Greek BERT: Greek version of BERT, pre-trained on Greek corpora.
  • NaiveBayesSpamDetector: an experimental e-mail spam filter that uses various forms of the Naive Bayes classifier. Download
  • NaturalOWL: a natural language generator for OWL ontologies that supports English and Greek; it can be used within Protégé.
  • NLITDB: A prototype natural language interface for temporal databases. Download
Datasets
  • PEFK: The "Prosocial and Effective Facilitation in Konversations" dataset. An aggregated and standardized dataset composed of all important facilitation datasets presented in Social Science literature. Unfortunately, we cannot provide direct downloads due to licensing. [Code]
  • FiNER-139: the dataset of our ACL 2022 paper "FiNER: Financial Numeric Entity Recognition for XBRL Tagging"; 1.1M sentences from SEC filings annotated with XBRL tags. [Dataset] [Paper] [Code]
  • EDGAR-CORPUS: the dataset of our ECONLP 2021 paper "EDGAR-CORPUS: Billions of Tokens Make the World Go Round"; a large-scale financial NLP corpus of annual reports (10-K filings) from the U.S. SEC EDGAR system, generated with EDGAR-CRAWLER. [Dataset] [Paper]
  • EU/UK RegIR dataset: The dataset of our EACL 2021 paper "Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations." Download
  • AspectTermSimilarities: manually specified similarities between aspect terms of English restaurant and laptop reviews, as used in our EACL 2014 paper "Multi-Granular Aspect Aggregation in Aspect-Based Sentiment Analysis". Download
  • Biomedical word embeddings: English word embeddings pre-trained on biomedical texts from MEDLINE®/PubMed® using the Word2Vec implementation of the gensim toolkit. [Readme] [Embeddings-200D] [Embeddings-400D]
  • Contracts dataset: the dataset of our ICAIL 2017 paper "Extracting Contract Elements". Download
  • EURLEX57K dataset: the dataset of our NLLP Workshop 2019 paper "Extreme Multi-Label Legal Text Classification: A case study in EU Legislation". Download
  • Enron-Spam: contains ham e-mail messages from the Enron corpus and spam messages. Download
  • Gazzetta dataset: The dataset of our papers "Deep Learning for User Comment Moderation" (ACL 2017 workshop "Abusive Content Online") and "Deeper Attention to Abusive User Content Moderation" (EMNLP 2017). Download
  • Ling-Spam: contains ham e-mail messages from a mailing list and spam messages. Download
  • Paraphrases: a collection of sentences and manually scored candidate paraphrases, as used in our EMNLP 2011 paper "A Generate and Rank Approach to Sentence Paraphrasing". Download
  • PU: contains ham e-mail messages (in encoded form) and spam messages. Download