Arca Verborum

A Multi-Source Lexical Database for Computational Historical Linguistics

What is Arca Verborum?

Arca Verborum is a project providing analysis-ready lexical databases for computational historical linguistics. The project integrates data from multiple sources, structured for immediate use in research and education. A very informal announcement and explanation is given in this blogpost by the author.

Data Series

Arca Verborum organizes data into distinct series, each derived from different sources but sharing a common structure:

Series A – Comparative wordlists from the Lexibank initiative: 149 datasets, 2.9M forms across 9,700+ languages
Series B (planned) – Wiktionary-derived etymological and lexical data

Why Series A?

While CLDF's normalized structure is excellent for data integrity, it requires significant preprocessing before analysis. Series A provides denormalized, pre-joined CSV files so you can start working immediately. It is intended for rapid method development and prototyping, student projects, and teaching computational linguistics, and will be used to bootstrap and validate other series.

Collections

Series A is available in three collections tailored for different use cases:

	Full	Core	CoreCog
Datasets	149	13	58
Forms	2,915,377	255,451	451,921
Languages	9,738	1,748	2,462
Concepts	168,263	4,379	27,794

Full Collection: Complete dataset with all 149 repositories.

Core Collection: 13 curated datasets for teaching, with global coverage.

CoreCog Collection: 58 datasets with expert cognate judgments for cognate research.

Download Full Download Core Download CoreCog

All collections share a single DOI: https://doi.org/10.5281/zenodo.17294927

Quick Start

Python

import csv

# Load the data (example with Core collection)
with open('arcaverborum-A-core-20251008/forms.csv', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    forms = list(reader)

# Forms with cognate judgments
cognate_forms = [f for f in forms if f['Cognacy']]

Python / Pandas

import pandas as pd

# Load the data (example with Core collection)
forms = pd.read_csv('arcaverborum-A-core-20251008/forms.csv')

# Forms with cognate judgments
cognate_forms = forms[forms['Cognacy'].notna()]

R

library(tidyverse)

# Load the data (example with Core collection)
forms <- read_csv('arcaverborum-A-core-20251008/forms.csv')

# Forms with Concepticon mapping
forms %>% filter(!is.na(Concepticon_Gloss))

Citation

If you use this dataset, please cite both Arca Verborum and Lexibank:

Arca Verborum

@dataset{tresoldi2025_arcaverborum,
  author = {Tresoldi, Tiago},
  title = {Arca Verborum: A Global Lexical Database for
           Computational Historical Linguistics},
  year = 2025,
  publisher = {Zenodo},
  version = A.20251008,
  doi = {10.5281/zenodo.17294927},
  url = {https://doi.org/10.5281/zenodo.17294927}
}

Lexibank

@article{list2022lexibank,
  author = {List, Johann-Mattis and Forkel, Robert and
            Greenhill, Simon J. and Rzymski, Christoph and
            Englisch, Johannes and Gray, Russell D.},
  title = {Lexibank, a public repository of standardized
           wordlists with computed phonological and lexical
           features},
  journal = {Scientific Data},
  volume = {9},
  number = {316},
  year = {2022},
  doi = {10.1038/s41597-022-01432-0},
  url = {https://doi.org/10.1038/s41597-022-01432-0}
}

Data Quality

Full collection coverage metrics:

Glottolog: 95.2% (language identification)
Concepticon: 84.7% (concept standardization)
Cognate data: 27.8% of forms
Segmentation: 77.7% of forms

Archive Contents: Each collection includes forms.csv, languages.csv, parameters.csv, metadata.csv, sources.bib, and validation_report.json with complete documentation.

Contact

Tiago Tresoldi
Department of Linguistics and Philology, Uppsala University
tiago.tresoldi@lingfil.uu.se

Related Work

GLED - Global Lexical Database (predecessor)
Lexibank - Source data repository
CLDF - Cross-Linguistic Data Format