Linguistics Resources

Electronic resources supporting teaching and research in linguistics. Included are indexes to articles in scholarly journals, online full-text sources, encyclopedias, dictionaries, and other resources.

Frequently Accessed Corpora

A corpus is a searchable database of language samples for linguistic research. A corpus may be based on written or spoken language. Some corpora are tagged or annotated by part of speech; other corpora are plain text.

Linguistic Data Consortium
600+ corpora of text and spoken language. U-M Library provides access to LDC corpora from 2017 onward to those actively and currently affiliated with U-M. Many earlier corpora are also available. Begin by making an account for yourself on the LDC website. You will then be contacted by the U-M administrator for LDC. For more information contact cness@umich.edu.
English-Corpora.org
Previously known as the "BYU Corpora", these are some of the most well-known corpora of American and British English. Includes COCA, COHA and British National Corpus. This link is to the U-M institutional account, with higher search limits for U-M researchers. GloWbE, COHA, and the COCA word-frequency lists are available to U-M affiliates for download from the library catalog. For downloadable full-text access, please ask the linguistics librarian to assist with a purchase. Corpora are available to search using the built-in tool, but data cannot be downloaded. To download corpora we have access to, please use the links below or contact the linguistics librarian. An overview on how to search corpora on the website can be found under the Overview tab.
COHA: The Corpus of Historical American English
Part of the English-Corpora.org collection. Corpus of Historical American English (COHA), is a 400-million corpus of text published 1890's-200's. It is the largest structured corpus of historical English (or any language, for that matter). COHA is related to other large corpora, including the Corpus of Contemporary American English (COCA), the 100 million word TIME Corpus (1920s-2000s), and the British National Corpus. This link is to the U-M institutional account, with higher search limits for U-M researchers. Sample COHA is available to U-M affiliates for full-download. For downloadable full-text access, please ask the linguistics librarian to assist with a purchase.
COCA: The Corpus of Contemporary American English
Part of the English-Corpora.org collection. The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of American English, with over 1 billion words, and the only large, genre-balanced corpus of American English. COCA was released in 2008. Text are from 1990-2019. This link is to the U-M institutional account, with higher search limits for U-M researchers. Sample COCA high-frequency word-lists are also available for download. For downloadable full-text access, please ask the linguistics librarian to assist with a purchase.

Available Corpora for Download

Corpus del Español
Not yet archived; link forthcoming. To access the corpus, please contact the linguistics librarian. The corpora from Corpus del Español provide billions of words of recent data from 21 Spanish-speaking countries, and they allow researchers, students, and teachers to gain insight into Spanish.
COHA: Corpus of Historical American English Full-Text (2024)
Not yet archived; link forthcoming. To access the corpus, please contact the linguistics librarian.
COHA: Corpus of Historical American English (2020)
The Corpus of Historical English (COHA) is the largest structured corpus of historical English. COHA contains more than 475 million words of text from the 1820s-2010s.
COCA: Corpus of Contemporary American English (100,000 word-frequency data, 2018)
The Corpus of Contemporary American English (COCA) is the only large and "balanced" corpus of American English. The corpus contains words of text from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, TV and movies subtitles, blogs, and other web pages.
GloWbE: Corpus of Global Web-based English (2019)
The corpus of Global Web-based English (GloWbE; pronounced "globe") allows you to carry out comparisons between different varieties of English. GloWbE contains about 1.9 billion words of text from twenty different countries.

Other Corpora and Text Collections

American English Dialect Recordings
Collection of 350 audio recordings documenting North American English dialects. Recordings were collected 1941-1984. Recordings include speech samples, linguistic interviews, oral histories, conversations, and excerpts from public speeches. Some recordings have transcripts. They were drawn from various archives and from the private collections of fifty collectors, including linguists, dialectologists, and folklorists.
BNC: British National Corpus
BYU-BNC is a 100-million word corpus of contemporary British English, based on written and spoken samples. Created by Oxford University Press in the 1980's to early 1990's. This link is to the U-M institutional account, with higher search limits for U-M researchers.
Buckeye Speech Corpus
Corpus of conversational speech with a northern U.S. accent. Transcribed and phonetically labeled. Audio and text files for use with speech analysis software. Free for noncommercial use but requires individual registration.
Child Language Data Exchange System (CHILDES)
Database of transcribed audio recordings of conversations with children. Samples are in English and in 25 other languages. Transcriptions and media can be downloaded to CLAN software. Data is transcribed in CHAT format.
Corpus Inventory
List of LDC and non-LDC corpora held at Stanford University.
Early English Books Online (EEBO)
Page images of nearly all books published in England, Ireland, Scotland, Wales and British North America, and works in English printed elsewhere, from 1473-1700. SGML text encoding provided by U-M's Text Creation Partnership for over 44,000 of the texts.
ELAR (Endangered Languages Archive)
A digital repository preserving and publishing endangered language documentation materials from around the world. Free but files require registration to access. ELAR was originally funded by Arcadia and is part of the Library of SOAS University of London.
Ethnologue Global Dataset
GIS data derived from the Ethnologue database, 24th edition, 2001. Contains the selected, raw linguistic data used to create Ethnologue. Files are in the standard tab-delimited format, which can be loaded into virtually any spreadsheet, database, or other data analysis tool. Note that this dataset does not include commentary, but rather, data fields with simple values that can be submitted to statistical analysis.
Google Books Ngram Viewer
Search 500 billion words of text to see changes in the frequency of words and phrases. When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years.

HathiTrust Digital Library
Massive digital library bringing together materials digitized by Google, the Internet Archive and libraries at 90 partner institutions. HathiTrust is open to all for searching and reading books. Only those from member institutions can download books, where available.

HTRC Analytics
Hathi Research Center. Supports large-scale computational analysis of the works in the HathiTrust Digital Library. Available to researchers affiliated with member libraries.

Helsinki Corpus
A selection of texts covering the Old, Middle, and Early Modern English periods.

International Corpus of English-GB
Documenting English as it is used globally, 24 teams in countries around the world are compiling a corpus of one million words of English as it is spoken in their country. Each corpus has been grammatically analyzed and has its own website. This is the ICE-Great Britain corpus, from University College London, the British component of the International Corpus of English.
Linguistic Atlas Project (LAP)
Information about English as it is spoken in the United States. Most of the projects included are from survey research carried out 1930-1980; some are more recent.

Michigan Corpus of Academic Spoken English (MICASE)
Collection of nearly 1.8 million words of transcribed speech (almost 200 hours of recordings) from the University of Michigan in Ann Arbor, created by researchers and students at the U-M English Language Institute (ELI). MICASE contains data from a wide range of speech events (including lectures, classroom discussions, lab sections, seminars, and advising sessions) and locations across the university. You can download entire transcripts in XML format.

OLAC: Language Resource Catalog
The Open Language Archives Community. Combined catalog of 60 language archives, including many that focus on endangered languages. Includes text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives
Speech Accent Archive
Close to 2,000 transcribed audio files of English-speakers with various accents reading the same paragraph.
TS Corpus: online corpus of Turkish
Project to build corpora of Turkish. Contains over a billion POSTagged tokens. TS Corpus is constructed and maintained by Taner and Turker Sezer.

Last Updated: Sep 30, 2025 1:34 PM

Subjects: Humanities

Tags: languages