Sustainable Data Formats


Overview

The file format in which data are stored and archived is a primary factor in the ability to use data in the future. As the custodian of the primary data, the researher should adopt an orderly system of data organization and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel, where appropriate or applicable.
File formats and file naming according to standards are necessary to ensure that data can be uniquely identified and made accessible for future uses.

When selecting tools for storing your data and preparing it for archiving, pay special attention to the output formats of your data. Data stored in a proprietary or obsolete format may be unusable to other researchers.


Accessible Formats

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Preferred Formats (general)

  • PDF/A - not Microsoft Word
  • ASCII - not Microsoft Excel
  • MPEG-4 - not Quicktime
  • TIFF or JPEG2000 - not GIF or JPG
  • XML or RDF - not RDBMS

Preferred Formats (detailed)

Below are tables that list differnt data types and the preferred formats for long-term preservation of the data. Other acceptable formats are listed but these may or may not ensure the long-term preservation of the data. Not all of these formats are accepted in the Mines institutional repository.


Digital Audio Data
Preferred Formats Other Acceptable Formats
  • Free Lossless Audio Codec (FLAC) (.flac)
  • Waveform Audio Format (WAV) (.wav)
  • MPEG-1 Audio Layer 3 (.mp3) - spoken word audio only
  • MPEG-1 Audio Layer 3 (.mp3)
  • Audio Interchange File Format (AIFF) (.aif) 

 


Digital Image Data
Preferred Formats Other Acceptable Formats
  • TIFF version 6 uncompressed (.tif)

Viewers: OMERO for conversion, viewing and metadata for biological microscope slides and other TIFF files.

  • JPEG (.jpeg, .jpg) but only if created in this format
  • TIFF (other versions)(.tif, .tiff)
  • JPEG 2000 (.jp2, .jpm)
  • Adobe Portable Document Format (PDF/A, PDF) (.pdf)
  • Photoshop files (.psd)
  • Standard applicable RAW image (.raw)

 


Digital Video Data
Preferred Formats Other Acceptable Formats
  • MPEG-4 High Profile (.mp4)
  • Motion JPEG 2000 (.mj2)
  • JPEG 2000 (.jp2, .jpm)

 


Chemistry Data: spectroscopy data; plots with contours, peak position and intensity
Preferred Formats

Convert NMR, IR, Raman, UV, Mass Spectrometry files to JCAMP format for ease in sharing.

JCAMP file viewers: JSpecView, ChemDoodle

 


Geospatial Data: vector and raster
Preferred Formats Other Acceptable Formats
  • ESRI Shapefile (.shp,.shx, .dbf; optional -- .prj, .sbx, .sbn)
  • geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data
  • Keyhole Mark-up Language (KML) (.kml)
  • ESRI Geodatabase format (.mdb)
  • MapInfo Interchange Format (.mif) for vector data
  • Adobe Illustrator (.ai)
  • CAD data (.dxf, .svg)
  • Binary formats of GIS and CAD packages
     

 


Qualitative Data: textual
Preferred Formats Other Acceptable Formats
  • eXtensible Mark-up Language (XML) text
    according to an appropriate Document Type Definition (DTD) or schema (.xsd)
  • Rich Text Format (.rtf)
  • plain text data, UTF-8 (unicode) (.txt)
  • plain text data, ASCII (.txt)
  • Hypertext Mark-up Language (HTML) (.html)
  • widely-used proprietary formats, e.g. MS Word (.doc/.docx)
  • LaTeX (.tex)

 


Quantitative Data: tabular data with extensive metadata
Preferred Formats Other Acceptable Formats
  • SPSS portable format (.por)
  • delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information
  • structured text or mark-up file containing metadata information, e.g. DDI XML file
  • delimited text of given character set (only characters not present in the data should be used as delimiters (.txt)
  • MS Access (.mdb/.accdb),
  • proprietary formats of statistical packages (e.g. SPSS .sav or Stata .dta)

In this case, the table contains the matrix of data plus metadata that has labels for variables, code labels and defined missing values.


Quantitative Data: tabular data with minimal metadata
Preferred Formats Other Acceptable Formats
  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab) including delimited text of a given character set with SQL data definition statements where appropriate
  • eXtensible Mark-up Language (.xml) according to appropriate Document Type Definition (DTD) or schema (.xsd)
  • Rich Text Format (.rtf)
  • Plain text data, ASCII (.txt)
  • delimited text of given character set (only characters not present in the data should be used as delimiters (.txt)
  • MS Word (.doc/.docx)
  • MS Access (.mdb/.accdb),
  • MS Excel (.xsl/.xlsl)
  • OpenDocument Spreadsheet (.ods)
  • dBase (.dbf)

In this case, the table contains the matrix of data but may or may not have column headings or variable names and probably no other metadata or labeling.


Scripts and Computer Code
Preferred Formats
Work directly with Research Support Services for latest information

 


Documentation
Preferred Formats Other Acceptable Formats
  • Open Document Text (.odt)
  • Rich Text Format (.rtf)
  • HTML (.htm, .html)
  • PDF/A or PDF (.pdf)
  • plain text (.txt)
  • widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/ .xlsx)
  • eXtensible Mark-up Language (.xml) according to appropriate Document Type Definition (DTD) or schema (.xsd)

 


 

Above tables adapted from: