Skip to Main Content

Text Data Mining: TDM Tools

Voyant

Voyant is a tool that allows for lightweight text analytics.

Constellate

Constellate is a platform for learning and performing text analysis, building datasets, and sharing analytics course materials from JSTOR and Portico.

HathiTrust Research Center (HTRC)

HathiTrust Research Center (HTRC)  facilitates text and data mining uses of the HathiTrust corpus which contains over 18 million items digitized by partner libraries

TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. this is most useful for handwritten texts that predate the printing press.

Topic Modeling

"Topic modeling is a form of text mining, a way of identifying patterns in a corpus. You take your corpus and run it through a tool which groups words across the corpus into 'topics'."

Concordance

A concordance is a listing of each word in a text (corpus) and the words that occur near it. "Key Word In Context" (KWIC) is a type of concordance.

Ngram

A sequence of n items from a given sample of text or speech.

Text analysis glossary

API

An Application Programming Interface, or API, is a software interface that allows two or more computer programs to communicate. They can be used to download large amounts of data from a website without requiring user input. Using an API requires some technical or programming knowledge.

Data Cleaning