JSTOR is a collection of research articles and books dating back to the earliest publications in humanities fields, especially language, literature, history, and philosophy. JSTOR allows for the creation and download of large amount of text content.
Wiley's preferred access solution for TDM is the Crossref Text and Data Mining Service. Academic subscribers can register with Crossref and will then be able to access subscribed content once they have accepted the Wiley click-through TDM license and received an API token.
Academic subscribers can perform TDM under license on subscribed content for non-commercial purposes at no extra cost. Wiley prefers access to content for TDM purposes takes place through an approved API service.
To mine full text content hosted on ScienceDirect you will need to use their API to download content which is specialized for text mining purposes. You can access the API via Elsevier's developers portal.
JAMA Network provides individuals who have access to our content the ability to download aggregated metadata. Note: Data access is restricted to the journals that a SMU is subscribed to from the JAMA Network. To see what we subscribe to, search in our catalog for items with JAMA in the title.
open-access archive for scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics
Data Mining Instructions: Use the arXiv API to access arXiv data, search, and linking facilities. The API can only be used to download metadata, not full-text. No key is required.
PLoS offers two APIs for data retrieval. The Article-Level Metrics API retrieves data regarding an article’s usage statistics to demonstrate its reach. The Search API provides the ability to query PLoS content across their journals. Data delivered in XML or JSON format. An API key is required to access either API.
The PMC Open Access Subset includes millions of journal articles and preprints that are made available under license terms that allow reuse. Not all articles in PMC are available for text mining or other reuse; many are under copyright.
The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books.
Thier public API is a RESTful API for retrieving open access content published by BMC. Resources are represented in JSON and Prism Aggregate (PAM) formats.
Books From Digital Libraries
Libraries and archives make some digitized content available that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century
For digital humanities research, includes text mining tools
Current SMU students and employees with documented print disabilities are eligible to access additional materials in HathiTrust. Contact libraryaccess@smu.edu for more information.
The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule.
HathiTrust offers a few different tools. The Bibliographic API can retrieve small amounts of bibliographic records. The Data API can retrieve content such as page scans and OCR text. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.
IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Data delivered in XML or JSON format.
Project Gutenberg offers free ebooks for public use. They offer works in many languages, but most are in English. All their ebooks are public domain, meaning the copyright has expired and that the newest title was originally published in 1923.
Data Mining Instructions: Project Gutenberg states that they will block any perceived use of automated tools to access their site, with some exceptions. This link has information on how bulk downloads are allowed.
To download all eBook files, set up a personal mirror site.
To download only some eBook files, use wget software.
You may also download a complete catalog data file in RDF/XML directly from their site for metadata research purposes. Data delivered in RDF/XML or a compressed folder.
DPLA connects people to America’s libraries, archives, museums, and other cultural heritage institutions. Materials found through DPLA—photographs, books, maps, news footage, oral histories, personal letters, museum objects, artwork, government documents, and more—are free and immediately available in digital format.