Research Guides: Search for Text Data Sets: Articles & Books

Research Articles & Books from Databases

JSTOR

JSTOR: Text analysis support
Get access to the metadata and full-text of available JSTOR journals, books, research reports, and pamphlets for text analysis and digital humanities research.

Wiley

Wiley
Wiley databases contain research articles and books related to social science and the sciences.
Crossref
Wiley's preferred access solution for TDM is the Crossref Text and Data Mining Service. Academic subscribers can register with Crossref and will then be able to access subscribed content once they have accepted the Wiley click-through TDM license and received an API token.
Wiley Text and Data Mining Policy
Academic subscribers can perform TDM under license on subscribed content for non-commercial purposes at no extra cost. Wiley prefers access to content for TDM purposes takes place through an approved API service.

Elsevier

ScienceDirect
Fulltext science database with articles & books

Elsevier API information
To mine full text content hosted on ScienceDirect you will need to use their API to download content which is specialized for text mining purposes. You can access the API via Elsevier's developers portal.
Elsevier text and data mining policy

JAMA

JAMA Network Text and Data Mining Services
JAMA Network provides individuals who have access to our content the ability to download aggregated metadata. Note: Data access is restricted to the journals that a SMU is subscribed to from the JAMA Network. To see what we subscribe to, search in our catalog for items with JAMA in the title.
Download the JAMA TDM Terms and Conditions here

OA Publishers and Indexes

arXiv
open-access archive for scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics
arXiv API (metadata)
Data Mining Instructions: Use the arXiv API to access arXiv data, search, and linking facilities. The API can only be used to download metadata, not full-text. No key is required.
arXiv API Bulk Data Access - Amazon S3
To access full-text articles in bulk, the researcher must purchase a license from Amazon S3. Data delivered in Atom XML format.

PLOS ONE (Public Library of Science)
Science & medicine articles

PLoS Text Mining
PLoS offers two APIs for data retrieval. The Article-Level Metrics API retrieves data regarding an article’s usage statistics to demonstrate its reach. The Search API provides the ability to query PLoS content across their journals. Data delivered in XML or JSON format. An API key is required to access either API.

PubMed
Biomedical articles from the 1950s-

Europe PubMed Central
A RESTful Web Service giving you access to publications and related information in the Europe PubMed Central database
PubMed Central (PMC) Open Access Subset
The PMC Open Access Subset includes millions of journal articles and preprints that are made available under license terms that allow reuse. Not all articles in PMC are available for text mining or other reuse; many are under copyright.
PubMed text mining tools
List of web applications and software that can be used on PubMed.
Core
CORE is an initiative in the UK to harvest and maintain metadata and full-text content from Open Access journals and repositories across the world.
CORE API
Data delivered in JSON format.
Biodiversity Heritage Library
The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books.
Biodiversity Heritage Library API
Request an API key to access the BHL API. Data delivered in JSON or XML format.
(BioMed Central) BMC API
Thier public API is a RESTful API for retrieving open access content published by BMC. Resources are represented in JSON and Prism Aggregate (PAM) formats.

Books From Digital Libraries

Libraries and archives make some digitized content available that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century

HathiTrust

HathiTrust Digital Library
For digital humanities research, includes text mining tools

Current SMU students and employees with documented print disabilities are eligible to access additional materials in HathiTrust. Contact libraryaccess@smu.edu for more information.

(HTRC) HathiTrust Research Center Analytics
Provides some computational analysis tools for Hathitrust, and contains the portal to access the Data Capsule

more... less...

The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule.
HathiTrust APIs
HathiTrust offers a few different tools. The Bibliographic API can retrieve small amounts of bibliographic records. The Data API can retrieve content such as page scans and OCR text. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.

Internet Archive

Internet Archive (IA)
The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books.
IA Downloading in bulk
IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Data delivered in XML or JSON format.

Project Gutenberg

Project Gutenberg
Project Gutenberg offers free ebooks for public use. They offer works in many languages, but most are in English. All their ebooks are public domain, meaning the copyright has expired and that the newest title was originally published in 1923.
Project Gutenberg Robot Access to Our Pages
Data Mining Instructions: Project Gutenberg states that they will block any perceived use of automated tools to access their site, with some exceptions. This link has information on how bulk downloads are allowed.

more... less...

To download all eBook files, set up a personal mirror site.
To download only some eBook files, use wget software.
You may also download a complete catalog data file in RDF/XML directly from their site for metadata research purposes. Data delivered in RDF/XML or a compressed folder.

Digital Public Library of America

Digital Public Library of America (DPLA)
DPLA connects people to America’s libraries, archives, museums, and other cultural heritage institutions. Materials found through DPLA—photographs, books, maps, news footage, oral histories, personal letters, museum objects, artwork, government documents, and more—are free and immediately available in digital format.
DPLA API Codex
Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.
Post45 Data Collective
Collections Of Book Data: Hathitrust Fiction, Iowa Writers' Workshop, NYT Hardcover Fiction Bestsellers