research data management
preparing data for sharing

Data selection
The first step in the process of preparing research data for sharing is data selection.
If large volume of data has been produced in a project, there may be a question about which data – given a limited budget in the project – should be made openly available. There are a few important points to consider.
What data are we required to make available?
The source of such obligations will most often be the policies of the research institution, funding body or a journal. We may also have commited to sharing specific resources in a data management plan.
What data must not be shared?
Restrictions may result from applicable common law or from a contract we have concluded.
What data can we not afford to share?
If it is beyond our budget to adequately process and document the data, it will not be made available.
Any given circumstances may change. Data which currently cannot be made available may become sharable in the future. It is therefore worth preserving the data itself and its existing documentation. The scientific or historical value of the collected data, its uniqueness, the possibility of re-collection, and the cost of re-creating the data should be considered.
file formats
It is necessary to ensure that files with data and documentation are made available in a suitable format.
Most repositories – including the UW Research Data Repository – recommend open file formats, which allow for opening or analysing files with open software. When using formats of this kind, due to the open nature of their documentation, even after many years there should be no problem converting them to other new formats – including those that do not yet exist.
Sharing data in open formats does not preclude ongoing work using closed (commercial) formats, often very popular. The important thing is that when data is being prepared for sharing, it should be converted to open formats and made available as such.
It is also acceptable to deposit data in two formats: closed (but often popular) and open. This arrangement makes it easier to use the data, both for those who prefer the popular closed formats and have the appropriate software, and for those who may find it easier to work with open formats.
file formats recommendations
Recommended file formats and guidelines for the preparation of tabular data
Type of data | Formats | Tips |
---|---|---|
Text |
| If a file contains code and paid libraries were used, convert the file to a vanilla version. If this is impossible, the libraries used should be listed in the dataset description. A copy of the file saved as plain text (.txt) can also be added. |
Image |
| |
Audio |
| |
Video |
| Video data is usually compressed and this is a desirable feature. If there is a need to keep video data from being compressed, proprietary formats shall normally be used. |
Archives |
| |
CAD |
| The SLDPRT and IGS formats can be converted to STEP. |
Tabular data |
| CSV file preparation:
|
Spreadsheets |
| Any tabular file should:
|
Metadata and documentation
The deposited data should be supported by appropriate metadata and documentation.
From a researcher’s perspective, adding metadata (information about the data) in the UW Research Data Repository means completing a form. Once entered, metadata can be downloaded from the repository in several common formats.
The dataset should also be supported by documentation containing all the information necessary to understand and properly interpret the data provided. A part of the documentation can be a README.txt file, which organises this kind of content. Readme file template.
If the deposited data or metadata are not complete, the documentation should include the appropriate explanation.
DOI, a persistent identifier
In the context of research data, the most important type of identifier is the DOI (digital object identifier).
From the researcher’s perspective, obtaining a DOI for a dataset practically comes down to choosing the right repository. In the case of the UW Research Data Repository, a DOI is automatically assigned to each dataset at the stage of saving the draft version of this dataset.