Tools and Datasets
Tools
XL-LEXEME

You can use XL-LEXEME to encode the meaning of words in specific contexts. It represents the state of the art in semantic change detection. Described in WiC Pretrained Model for Cross-Lingual LEXical sEMantic changE
Available here: https://github.com/pierluigic/xl-lexeme
Janus
With Janus you can generate sentences in which a word appears with a meaning you choose and in the writing style of the historical period you prefer (1700-200).
Described in Sense-specific Historical Word Usage Generation
Available here: https://huggingface.co/ChangeIsKey/llama3-janus
Datasets
WIKIPA
WikIPA is a multilingual benchmark dataset designed for speech-to-IPA (STIPA) transcription, linking spoken audio with International Phonetic Alphabet (IPA) transcriptions.
The dataset integrates two large-scale community-driven resources:
- WikiPron — human-curated IPA pronunciations extracted from Wiktionary
- Lingua Libre — crowdsourced recordings of spoken lexical items
By connecting these two resources, WikIPA provides a dataset that links speech audio to phonetic representations, enabling evaluation of models that transcribe speech directly into IPA.
The dataset supports both:
- Broad (phonemic) IPA transcriptions
- Narrow (phonetic) IPA transcriptions
The dataset contains 289,694 audio–IPA pairs across 78 languages.
Available here: https://huggingface.co/datasets/pierluigic/WikIPA
GWSD
The GWSD Dataset (Graded Word Sense Disambiguation Dataset) is a sense-annotated dataset designed for studying diachronic word usage and semantic change. It contains:
-
2584 word usages from the Oxford English Dicitonary (OED) and
-
2584 automatically generated word usage examples.
In particular, the automatically generated word usage are obtained using Janus, a fine-tuned language model trained on the Oxford English Dictionary (OED), allowing for temporally aligned and sense-specific word usage examples spanning historical periods from 1700–2010. Described in Sense-specific Historical Word Usage Generation
Available here: https://zenodo.org/records/14974455
DWUG IT: Diachronic Word Usage Graphs for Italian
DWUG IT provides sense annotations for up to 50 sentences for each time period (1948–1960 and 1970–1990) and for each of the 18 target words. The annotations were performed on the Italian corpus L’Unità.
Available here: https://ceur-ws.org/Vol-3878/22_main_long.pdf