Arca Verborum

A Multi-Source Lexical Database for Computational Historical Linguistics

What is Arca Verborum?

Arca Verborum is a project providing analysis-ready lexical databases for computational historical linguistics. The project integrates data from multiple sources, structured for immediate use in research and education. A very informal announcement and explanation is given in this blogpost by the author.

Data Series

Arca Verborum organizes data into distinct series, each derived from different sources but sharing a common structure:

Why Series A?

While CLDF's normalized structure is excellent for data integrity, it requires significant preprocessing before analysis. Series A provides denormalized, pre-joined CSV files so you can start working immediately. It is intended for rapid method development and prototyping, student projects, and teaching computational linguistics, and will be used to bootstrap and validate other series.

Collections

Series A is available in three collections tailored for different use cases:

Full Core CoreCog
Datasets 149 13 58
Forms 2,915,377 255,451 451,921
Languages 9,738 1,748 2,462
Concepts 168,263 4,379 27,794

Full Collection: Complete dataset with all 149 repositories.

Core Collection: 13 curated datasets for teaching, with global coverage.

CoreCog Collection: 58 datasets with expert cognate judgments for cognate research.

Download Full Download Core Download CoreCog

All collections share a single DOI: https://doi.org/10.5281/zenodo.17294927

Quick Start

Python

import csv

# Load the data (example with Core collection)
with open('arcaverborum-A-core-20251008/forms.csv', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    forms = list(reader)

# Forms with cognate judgments
cognate_forms = [f for f in forms if f['Cognacy']]

Python / Pandas

import pandas as pd

# Load the data (example with Core collection)
forms = pd.read_csv('arcaverborum-A-core-20251008/forms.csv')

# Forms with cognate judgments
cognate_forms = forms[forms['Cognacy'].notna()]

R

library(tidyverse)

# Load the data (example with Core collection)
forms <- read_csv('arcaverborum-A-core-20251008/forms.csv')

# Forms with Concepticon mapping
forms %>% filter(!is.na(Concepticon_Gloss))

Citation

If you use this dataset, please cite both Arca Verborum and Lexibank:

DOI

Arca Verborum

@dataset{tresoldi2025_arcaverborum,
  author = {Tresoldi, Tiago},
  title = {Arca Verborum: A Global Lexical Database for
           Computational Historical Linguistics},
  year = 2025,
  publisher = {Zenodo},
  version = A.20251008,
  doi = {10.5281/zenodo.17294927},
  url = {https://doi.org/10.5281/zenodo.17294927}
}

Lexibank

@article{list2022lexibank,
  author = {List, Johann-Mattis and Forkel, Robert and
            Greenhill, Simon J. and Rzymski, Christoph and
            Englisch, Johannes and Gray, Russell D.},
  title = {Lexibank, a public repository of standardized
           wordlists with computed phonological and lexical
           features},
  journal = {Scientific Data},
  volume = {9},
  number = {316},
  year = {2022},
  doi = {10.1038/s41597-022-01432-0},
  url = {https://doi.org/10.1038/s41597-022-01432-0}
}

Data Quality

Full collection coverage metrics:

Archive Contents: Each collection includes forms.csv, languages.csv, parameters.csv, metadata.csv, sources.bib, and validation_report.json with complete documentation.

Contact

Tiago Tresoldi
Department of Linguistics and Philology, Uppsala University
tiago.tresoldi@lingfil.uu.se

Related Work