Arca Verborum
A Multi-Source Lexical Database for Computational Historical Linguistics
What is Arca Verborum?
Arca Verborum is a project providing analysis-ready lexical databases for computational historical linguistics. The project integrates data from multiple sources, structured for immediate use in research and education. A very informal announcement and explanation is given in this blogpost by the author.
Data Series
Arca Verborum organizes data into distinct series, each derived from different sources but sharing a common structure:
- Series A – Comparative wordlists from the Lexibank initiative: 149 datasets, 2.9M forms across 9,700+ languages
- Series B (planned) – Wiktionary-derived etymological and lexical data
Why Series A?
While CLDF's normalized structure is excellent for data integrity, it requires significant preprocessing before analysis. Series A provides denormalized, pre-joined CSV files so you can start working immediately. It is intended for rapid method development and prototyping, student projects, and teaching computational linguistics, and will be used to bootstrap and validate other series.
Collections
Series A is available in three collections tailored for different use cases:
Full | Core | CoreCog | |
---|---|---|---|
Datasets | 149 | 13 | 58 |
Forms | 2,915,377 | 255,451 | 451,921 |
Languages | 9,738 | 1,748 | 2,462 |
Concepts | 168,263 | 4,379 | 27,794 |
Full Collection: Complete dataset with all 149 repositories.
Core Collection: 13 curated datasets for teaching, with global coverage.
CoreCog Collection: 58 datasets with expert cognate judgments for cognate research.
Download Full Download Core Download CoreCog
All collections share a single DOI: https://doi.org/10.5281/zenodo.17294927
Quick Start
Python
import csv
# Load the data (example with Core collection)
with open('arcaverborum-A-core-20251008/forms.csv', encoding='utf-8') as f:
reader = csv.DictReader(f)
forms = list(reader)
# Forms with cognate judgments
cognate_forms = [f for f in forms if f['Cognacy']]
Python / Pandas
import pandas as pd
# Load the data (example with Core collection)
forms = pd.read_csv('arcaverborum-A-core-20251008/forms.csv')
# Forms with cognate judgments
cognate_forms = forms[forms['Cognacy'].notna()]
R
library(tidyverse)
# Load the data (example with Core collection)
forms <- read_csv('arcaverborum-A-core-20251008/forms.csv')
# Forms with Concepticon mapping
forms %>% filter(!is.na(Concepticon_Gloss))
Citation
If you use this dataset, please cite both Arca Verborum and Lexibank:
Arca Verborum
@dataset{tresoldi2025_arcaverborum,
author = {Tresoldi, Tiago},
title = {Arca Verborum: A Global Lexical Database for
Computational Historical Linguistics},
year = 2025,
publisher = {Zenodo},
version = A.20251008,
doi = {10.5281/zenodo.17294927},
url = {https://doi.org/10.5281/zenodo.17294927}
}
Lexibank
@article{list2022lexibank,
author = {List, Johann-Mattis and Forkel, Robert and
Greenhill, Simon J. and Rzymski, Christoph and
Englisch, Johannes and Gray, Russell D.},
title = {Lexibank, a public repository of standardized
wordlists with computed phonological and lexical
features},
journal = {Scientific Data},
volume = {9},
number = {316},
year = {2022},
doi = {10.1038/s41597-022-01432-0},
url = {https://doi.org/10.1038/s41597-022-01432-0}
}
Data Quality
Full collection coverage metrics:
- Glottolog: 95.2% (language identification)
- Concepticon: 84.7% (concept standardization)
- Cognate data: 27.8% of forms
- Segmentation: 77.7% of forms
Archive Contents: Each collection includes forms.csv, languages.csv, parameters.csv, metadata.csv, sources.bib, and validation_report.json with complete documentation.
Contact
Tiago Tresoldi
Department of Linguistics and Philology, Uppsala University
tiago.tresoldi@lingfil.uu.se