Research Guides: Research Data Management: Process & Analyze

Processing Data

Data must be processed before it can be analyzed. This can involve verifying, organizing, transforming, integrating, or extracting the data from its current form. The process phase is where problems with the data are identified and corrected.

Documenting your processing methods is very important so that you can reuse your data, as well as allow it to be used by others. Data that is well documented is identifiable and usable, and your research results are more likely to be replicated and verified.

Analyzing Data

Analysis of data helps you to describe facts, detect patterns, develop explanations, and test hypotheses. It can also mean reviewing and evaluating whether data that has been created or acquired can be saved for long-term access and preservation. The process includes data quality assurance, statistical data analysis, modeling, and interpretation of analysis results

Different techniques are used for data analysis, depending on the field of research. Some institutions use High Performance Computing systems to analyze huge volumes of data.

Data mining and data visualization are important techniques in this process, and there are various tools that are used. R and Python are among the most popular languages used for data analysis.

Documenting and Describing Data

It is essential that data is properly documented, for it to be properly understood, reused and cited. Metadata is the term used to document data. Basic information that needs to be recorded includes:

Data collection: who, when, and why
Data interpretation information: experimental conditions, statistical sampling, calibration information
Data rights and responsibilities, including licensing (if the data is shared) or conditions of access (if access is restricted)

Metadata can also be created at the project level (broader) and at the dataset level (more narrow). Examples of these are:

Project-level Documentation - the “who, what, where, when, how and why” of the dataset, context for understanding why the data were collected and how data were used.

Name of project
Principal investigator and collaborators
Context of data collection (geographic location, date of collection, etc)
Data collection methods
Structure, organization of data files
Data sources used
Data validation, quality assurance
Transformations of data from the raw data through analysis
Information on confidentiality, access & use conditions
Project sponsor (if any)

Dataset documentation - more detail about the data and the dataset

Variable names, and description
Explanation of codes and classification schemes used
Algorithms used to transform data
Data acquisition details
File format and software (including version) used

Open Source Option

Open Refine
This is a free, open source tool for cleaning up messy data: cleaning it, transforming it from one format into another, and extending it with web services and external data.
Open Refine Demo
Open Refine Step-by-Step Tutorial