research data management

preparing data for sharing

Data selection

The first step in the process of preparing research data for sharing is data selection.

If large volume of data has been produced in a project, there may be a question about which data – given a limited budget in the project – should be made openly available. There are a few important points to consider.

What data are we required to make available?
The source of such obligations will most often be the policies of the research institution, funding body or a journal. We may also have commited to sharing specific resources in a data management plan.

What data must not be shared?
Restrictions may result from applicable common law or from a contract we have concluded.

What data can we not afford to share?
If it is beyond our budget to adequately process and document the data, it will not be made available.

Any given circumstances may change. Data which currently cannot be made available may become sharable in the future. It is therefore worth preserving the data itself and its existing documentation. The scientific or historical value of the collected data, its uniqueness, the possibility of re-collection, and the cost of re-creating the data should be considered.

file formats

It is necessary to ensure that files with data and documentation are made available in a suitable format.

Most repositories – including the UW Research Data Repository – recommend open file formats, which allow for opening or analysing files with open software. When using formats of this kind, due to the open nature of their documentation, even after many years there should be no problem converting them to other new formats – including those that do not yet exist.

Sharing data in open formats does not preclude ongoing work using closed (commercial) formats, often very popular. The important thing is that when data is being prepared for sharing, it should be converted to open formats and made available as such.

It is also acceptable to deposit data in two formats: closed (but often popular) and open. This arrangement makes it easier to use the data, both for those who prefer the popular closed formats and have the appropriate software, and for those who may find it easier to work with open formats.

file formats recommendations

Recommended file formats and guidelines for the preparation of tabular data

Type of dataFormatsTips
Text
  • txt
  • odt
  • html
  • xml
  • programming language native formats
If a file contains code and paid libraries were used, convert the file to a vanilla version. If this is impossible, the libraries used should be listed in the dataset description. A copy of the file saved as plain text (.txt) can also be added.
Image
  • png
  • jpeg2000
  • tiff
Audio
  • wav
Video
  • mkv
  • ogg
  • ogv
  • mp4 (acceptable)
  • mov (acceptable)
Video data is usually compressed and this is a desirable feature. If there is a need to keep video data from being compressed, proprietary formats shall normally be used.
Archives
  • zip
CAD
  • step
The SLDPRT and IGS formats can be converted to STEP.
Tabular data
  • csv
  • tab
  • ods
  • rdata
  • sav/spv
CSV file preparation:
  • UTF-8 encoding,
  • text separator: inverted commas,
  • numeric variables should not be enclosed in inverted commas, because then the variable will be treated as text during automatic analysis,
  • decimal separator: full stop,
  • field separator: comma or semicolon,
  • variables’ names should be in the first row only,
  • all non-empty columns must have unique names
Spreadsheets
  • xlsx
  • ods
Any tabular file should:
  • consist of only one sheet,
  • contain only one table in a vertical layout, that is, each column used should contain one variable, and the individual rows should contain the values of the variables for one case,
  • not contain merged cells
  • not contain blank columns or rows: the table should start from cell A1,
  • contain only valid variable names in row 1,
  • from row 2 downwards contain only the values of the variables,
  • not contain comments, explanations of units and abbreviations used, descriptions of measurement conditions, etc. – these should be included in the file description, the dataset description, and/or in a dedicated readme.txt file attached to the dataset (template)

Metadata and documentation

The deposited data should be supported by appropriate metadata and documentation.

From a researcher’s perspective, adding metadata (information about the data) in the UW Research Data Repository means completing a form. Once entered, metadata can be downloaded from the repository in several common formats.

The dataset should also be supported by documentation containing all the information necessary to understand and properly interpret the data provided. A part of the documentation can be a README.txt file, which organises this kind of content. Readme file template.

If the deposited data or metadata are not complete, the documentation should include the appropriate explanation.

DOI, a persistent identifier

In the context of research data, the most important type of identifier is the DOI (digital object identifier).

From the researcher’s perspective, obtaining a DOI for a dataset practically comes down to choosing the right repository. In the case of the UW Research Data Repository, a DOI is automatically assigned to each dataset at the stage of saving the draft version of this dataset.