Manage

Overview

Note that properly managing data does not necessarily equate to sharing or publishing those data. But it is a good idea to complete the data lifecycle through sharing and publishing.

Types of Data

Research Data Defined

One definition of research data is: “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” (OMB Circular 110). Across all agencies including the NSF the definition of research data does not mean summary statistics or tables; rather, it means the data on which summary statistics and tables are based.


Research Data Examples

  • Documents (text, Word), spreadsheets, print outs
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audio, video
  • Photographs, films, x-rays, negatives,
  • Protein or genetic sequences
  • Spectra, spectroscope data
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts, code, software
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols
  • Computers and computer data storage devices
  • Synthetic compounds
  • Organisms, cell lines, viruses, cell products
  • Cloned coordinates, plants animals
Exclusions

Some kinds of data might not be sharable due to the nature of the items themselves, or to ethical and privacy concerns. As defined by the OMB, this refers to:

  • Preliminary analyses
  • Drafts of scientific papers
  • Plans for future research
  • Peer reviews
  • Communications with colleagues
  • Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published or similar information which is protected under law
  • Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study

Additionally, research data managment is not records managmeent for projects or university business data. And therefore does not manage such items as:

  • Correspondence (electronic mail and paper-based correspondence)
  • Project files
  • Grant applications
  • Ethics applications
  • Technical reports
  • Research reports
  • Signed consent forms
  • Results of compliance reviews: (export Controls and human subjects)
  • Software to read proprietary vendor data formats
File Naming and Versioning

Overview

Plan the directory structure and file naming conventions before creating your data. Plan for version tracking of datasets and documents. Use project-specific conventions or disciplinary standards or best practices. The following are general best practices.


Organizational Tips

  • Decide upon a convention and stick to it. Always include the same information.
  • Consider organizing directories or folders by date, date/time, place, instrument, project, type of data, variable name or a combination of these using a hierarchical directory structure
  • Same applies to filenames. If you were able to organize directories by place and date/time then filenames might be organized by type of data and variable name
  • Test your organizational structure on team members before implementing. Does it make sense to them too or is there confusion?
  • Consider organizational structures that will help you later decide which data are the most important to deposit and make publicly accessible to others
  • Consider what structure will make it easier to programmatically walk through your data

Directory and File Naming Conventions

  • When using date information, use the YYYY-MM-DD format over other formats
  • Keep file and folder names less than 32 characters.
  • Include relevant information like unique identifiers, project name, grant numbers or research data names
  • Try to name runs of an experiment sequentially
  • Use software application-specific 3-letter file extensions and lowercase them: mov, tif, wrl
  • When using sequential numbering, make sure to use leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100.
  • No special characters: & , * % # ; * ( ) ! @$ ^ ~ ‘ { } [ ] ? < > –
  • Use only one period and before the file extension (e.g. name_paper.doc and NOT name.paper.doc OR name_paper..doc)
Storage and Backup

Overview

In general, it is the responsibility of the researcher to include proposal monies for computing resources, storage, backup, preservation, high performance computing or use of the Mines Institutional Repository. Below are backup and storage options available to Mines researchers.


Active Project Storage Provided by CCIT

ADIT drives
  • Suitable for data-to-day office and student work
  • Suitable for some research data needs, but storage is typically on the order of ~50 gigabytes (GB) per user
  • Is generally accessible on the servers of Hornet and Talon (Windows file servers) and Fermat and Isengard (Linux file servers)
  • Drives are backed up
OreBits Storage Service
  • Suitable for most research data needs
  • Accessible to those having a Mines Multipass ID
  • Granular permissions
  • Purchased in increments of a terabyte
  • Customizable options based on need (single-copy, replicates, backup)

Backup Services (for lab computers, laptops)

Currently, the best backup option for your research lab computers or laptops is to use built-in Mac and Windows software. Use external drives/disks designed for backups as opposed portable media like USB drives, DVDs or CDs. These external drives should be at least 50% larger than the data/disk you want to backup.

  • Mac: use the built-in Time Machine. Create both a backup disk and offsite backup disk.
  • Windows: use Windows backup. Save to your hard drive and another medium.
  • UNIX/Linux: use rsync with an external USB disk
  • Purchase of cloud computing services in accordance with the Mines Cloud Computing Policy (under development)

End-of-Project Storage (Mines Institutional Repository)

  • Suitable for end-of-project deposit of materials in order to meet requirements of making data publicly accessible
  • Suitable for small to medium sized datasets
  • Space is limited; select the highest priority items
  • Cost is $1000 per terabyte (TB)
  • Free < 10 gigabytes (GB)
  • Creates backup copies

High-Performance Computing (HPC)

  • Suitable for active projects only; not available as permanent storage
  • Either purchase your own node or use the supercomputer
  • See HPC for more information
Metadata Standards

Overview

Metadata is structured and descriptive information about an item or object. It is a standardized way to explain the who, what, where, when and how of data creation and methods. Metadata and other documentation enable the researcher to understand their data in detail and enables other researchers to discover, use and properly cite the item or object. Metadata standards have been created to facilitate the description of research data using a defined set of elements. And some disciplines have preferred metadata standards.

Data repositories may have specific metadata standard requirements that must be met in order to deposit data. If you intend to deposit your data in a subject- or discipline-specific repository, check their deposit and metadata requirements before including the repository in your data management plan.


Common Metadata Standards

  • Dublin Core (used my the Mines institutional repository): a general standard, can be adapted for specific disciplines
  • FGDC (Federal Geographic Data Committee): used by many Federal agencies for geospatial data; some tools are available
  • MODS (Metadata Object Description Schema): richer than Dublin Core and can be used for a variety of purposes
  • PREMIS (preservation metadata)
  • METS (both descriptive and technical rights and some preservation fields included)
  • DIF (Directory Interchange Format): for earth science data
Metadata Creation

Overview

As the custodian of the primary data, the researcher should ensure project data are properly documented in order to facilitate current use and enable future discovery and sharing. As early as you can, document your data and your data organization protocol, even before data collection begins; doing so will make data documentation easier and reduce the likelihood that you will forget aspects of your data later in the research project.

The following is a list of elements and aspects of your research project and data that should be documented, regardless of discipline. At minimum, this information should be stored in a readme.txt file or the equivalent, together with the data. The Mines Research Support Services group uses this documentation to create the required metadata for the Mines institutional repository.


– Elements marked with an * are required by the Mines institutional repository.
– See the Deposit with Mines page to understand the submittal process


General Information

  • *Title: name of the dataset or research project that produced it
  • *Creator: names and addresses of the organization or people who created the data
  • Identifier: number used to identify the data, even if it is just an internal project reference number
  • *Researcher identifier: a unique and persistent digital identifier that distinguishes you from every other researcher or author; requires registration with ResearchID or ORCID
  • *Abstract: a concise description or summary of the dataset
  • *Subject: keywords or phrases describing the subject or content of the data; these are additional search terms that are not listed in the abstract
  • *Funders: name of the organizations or agencies who funded the research
  • *Award: the grant number(s) if the data was generated from work on a grant
  • *Rights: any known intellectual property rights held for the data (copyright)
  • Publication citations: any citations that describe or use the data

Data Characteristics

  • *Access information: if you deposited the data in a repository external to Mines, describe where and how the data can be accessed by other researchers
  • *Access restrictions: if there are restrictions on making the data openly accessible indicated the nature of the restriction and how long they need to be in place
  • *Language: language(s) of the intellectual content
  • *Dates: key dates (and times) associated with the data, including: funding period; project start and end date; release date; time period covered by the data (coverage); and other dates associated with the data lifespan, e.g., maintenance cycle, update schedule, date of last update
  • *Date of publication: the date the data was made available, created or compiled as an entity for use by others
  • *Location: spatial coverage of the data or sampling site information
  • Methodology: how the data was generated, including equipment or software used, experimental protocol, other things one might include in a lab notebook
  • *Data processing: during your research, record information on how the data has been altered or processed
  • Sources: citations to material for data derived from other sources, including details of where the source data is held and how it was accessed
  • Unit of analysis: the major entity that is being analyzed in the study
  • *Type: the dominant kinds of data; choose from Collection, Event, Image, Moving Image, Physical Object, Software, Sound, Text

File Characteristics

  • Count: total number of files
  • *Size: how much space the dataset requires on a computer server
  • *File names: list of all data files associated with the project, with their names and file extensions (e.g. ‘NWPalaceTR.WRL’, ‘stone.mov’)
  • *File formats: format(s) of the data, e.g. FITS, SPSS, HTML, JPEG, and any software required to read the data
  • *File structure: organization of the data file(s) and the layout of the variables, when applicable
  • *Variable list: list of variables in the data files, when applicable
  • *Code lists: explanation of codes or abbreviations used in either filenames or the variables in the data files (e.g. ‘999 indicates a missing value in the data’)
  • *Versions: date/time stamp for each file, and use a separate ID for each version
  • Checksums: to test if the files have changed over time
Metadata Examples

Overview

To deposit in the Mines institutional repository, metadata (a description of the item/object) is required. The following example is for a dataset and is typical of the information that needs to be gathered in order to make a deposit.


Author: Lauenroth, William K.

Title: SGS-LTER Bouteloua gracilis Removal Experiment Vegetation Data (ARS #155) on the Central Plains Experimental Range, Nunn, Colorado, USA 1997-2008

Keywords: populations ; blue grama ; population dynamics ; density ; plants ; disturbance

Abstract: Six sites approximately 6 km apart were selected at the Central Plains Experimental Range in 1997. Within each site, there was a pair of adjacent ungrazed and moderately summer grazed (40-60% removal of annual above ground production by cattle) locations. Grazed locations had been grazed from 1939 to present and ungrazed locations had been protected from 1991 to present by the establishment of exclosures. Within grazed and ungrazed locations, all tillers and root crowns of B. gracilis were removed from two treatment plots (3 m x 3 m) with all other vegetation undisturbed. Two control plots were established adjacent to the treatment plots. Plant density was measured annually by species in a fixed 1m x 1m quadrant in the center of treatment and control plots. For clonal species, an individual plant was defined as a group of tillers connected by a crown (Coffin & Lauenroth 1988, Fair et al. 1999). Seedlings were counted as separate individuals. In the same quadrant, basal cover by species, bare soil, and litter were estimated annually using a point frame. A total of 40 points were read from four locations halfway between the center point and corners of the 1m x 1m quadrant. Density was measured from 1998 to 2005 and cover from 1997 to 2006. All measurements were taken in late June/early July.

Award: NSF Grant Number DEB-0823405.

Publisher: Colorado State University. Libraries

Date: 1997 – 2008

Type: Dataset ; Text ; Still image ; Metadata

Language: English

Spatial: The Short Grass Steppe site encompasses a large portion of the Colorado Piedmont Section of the western Great Plains. The extent is defined as the boundaries of the Central Plains Experimental Range (CPER). The CPER has a single ownership and land use (livestock grazing). The PNG is characterized by a mosaic of ownership and land use. Ownership includes federal, state or private and land use consists of livestock grazing or row-crops. There are NGO conservation groups that exert influence over the area, particularly on federal lands.

Referenced by: Munson, Seth M. (2009), Plant community and ecosystem change on conservation reserve program lands in northeastern Colorado. (Unpublished doctoral dissertation). Colorado State University. http://hdl.handle.net/10217/76822

Referenced by: Munson, S. M. and Lauenroth, W. K. (2009), Plant population and community responses to removal of dominant species in the shortgrass steppe. Journal of Vegetation Science, 20: 224–232. http://dx.doi.org/10.1111/j.1654-1103.2009.05556.x

Contributor: University of Wyoming. Dept. of Botany.

Rights: Data sets are open. Please include tag line in report or manuscript: Data sets were provided by the Shortgrass Steppe Long Term Ecological Research group, a partnership between Colorado State University, United States Department of Agriculture, Agricultural Research Service, and the U.S. Forest Service Pawnee National Grassland. Significant funding for these data was provided by the National Science Foundation Long Term Ecological Research program.