Tools

XL-LEXEME

xl-lexeme

You can use XL-LEXEME to encode the meaning of words in specific contexts. It represents the state of the art in semantic change detection. Described in WiC Pretrained Model for Cross-Lingual LEXical sEMantic changE

Available here: https://github.com/pierluigic/xl-lexeme

Janus

janus With Janus you can generate sentences in which a word appears with a meaning you choose and in the writing style of the historical period you prefer (1700-200). Described in Sense-specific Historical Word Usage Generation

Available here: https://huggingface.co/ChangeIsKey/llama3-janus

Datasets

WIKIPA

WikIPA is a multilingual benchmark dataset designed for speech-to-IPA (STIPA) transcription, linking spoken audio with International Phonetic Alphabet (IPA) transcriptions.

The dataset integrates two large-scale community-driven resources:

  • WikiPron — human-curated IPA pronunciations extracted from Wiktionary
  • Lingua Libre — crowdsourced recordings of spoken lexical items

By connecting these two resources, WikIPA provides a dataset that links speech audio to phonetic representations, enabling evaluation of models that transcribe speech directly into IPA.

The dataset supports both:

  • Broad (phonemic) IPA transcriptions
  • Narrow (phonetic) IPA transcriptions

The dataset contains 289,694 audio–IPA pairs across 78 languages.

Available here: https://huggingface.co/datasets/pierluigic/WikIPA

GWSD

The GWSD Dataset (Graded Word Sense Disambiguation Dataset) is a sense-annotated dataset designed for studying diachronic word usage and semantic change. It contains:

  • 2584 word usages from the Oxford English Dicitonary (OED) and

  • 2584 automatically generated word usage examples.

In particular, the automatically generated word usage are obtained using Janus, a fine-tuned language model trained on the Oxford English Dictionary (OED), allowing for temporally aligned and sense-specific word usage examples spanning historical periods from 1700–2010. Described in Sense-specific Historical Word Usage Generation

Available here: https://zenodo.org/records/14974455

DWUG IT: Diachronic Word Usage Graphs for Italian

DWUG IT provides sense annotations for up to 50 sentences for each time period (1948–1960 and 1970–1990) and for each of the 18 target words. The annotations were performed on the Italian corpus L’Unità.

Available here: https://ceur-ws.org/Vol-3878/22_main_long.pdf