Skip to Main Content

Digital Scholarship

Introduction to digital scholarship resources, tools, and projects

Text Analysis

Text analysis and text mining are processes that derive information from texts such as novels, monographs, articles, web pages, etc. You can use text analysis tools to quickly search through a large corpus, generate word clouds or find word frequency, or perform more complex tasks like identifying patterns in parts of speech or identifying sentiments, moods, and emotions in a corpus.
 

Topic Modeling

Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents (text-based corpus), detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize your corpus. If you are interested in learning about tools, you can reach out to U-M Metadata Engagement Librarian and Text Analysis expert Matt Carruthers or email the digital scholarship team (library-ds@umich.edu

Text Analysis Tools

  • AntConc is a free desktop corpus analysis toolkit for visualizing concordances and text analysis. AntConc is useful for finding clusters (frequency patterns of word sequences) or n-grams (sequences of n words within your corpus or document). 
  • Voyant is a web-based tool for reading and analyzing your digital texts. It can use texts in a variety of formats including plain text, HTML, XML, PDF, RTF, and MS Word. You can also use it to perform lexical analysis including the study of frequency and distribution data. Learn more about Voyant and how to work with the tool by looking at the Intro to Voyant workshop facilitated by the digital scholarship team.  

  • Google NGram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published.
  • HathiTrust Bookworm is a tool that visualizes language usage trends in repositories of digitized texts. 
  • HathiTrust Analytics supports large-scale computational analysis such as metadata and word counts, and computational text analysis on user-created collections of volumes of the works in the HathiTrust Digital Library. 
  • ProQuest TDM Studio is a web-based collaborative platform that allows you to access and analyze large amounts of text data. Using content retrieved from ProQuest database, you can build your corpus and conduct data analysis, text mining, and visualization to uncover relationships, patterns, and connections within and between datasets. 

Additional Resources

Below there is a series of additional resources for text analysis, including access to datasets, corpus, and collection of packages for data science.


R Package Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Python package (Pandas) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
TAPor, or the Text Analysis Portal for Research, is an online portal where users can store and keep track of texts they wish to study, learn about and experiment with different tools, and use those tools to analyze text.


If you have further questions on tools, resources or text analysis methods, you can contact U-M Metadata Engagement Librarian and Text Analysis expert Matt Carruthers or email the digital scholarship team at library-ds@umich.edu to schedule a consultation.