An overview of the safedata package

The safedata R package is designed to discover and work with data using the formatting and indexing API designed for the Stability of Altered Forest Ecosystems (SAFE) Project. The SAFE Project is one the largest ecological experiments in the world, investigating the effects of human activities on biodiversity and ecosystem function in the Malaysian rainforest.

Research conducted at the SAFE Project encompasses expertise from many disciplines and institutions, each running interlinked projects that help to develop our understanding of ecology within changing environments. Data from the many activities conducted through the SAFE Project are curated and published to a community repository hosted at Zenodo.The safedata package enables researchers to quickly and easily interface with these datasets.

The SAFE Project dataset workflow

All researchers working at the SAFE Project are required to submit their project data to the SAFE Zenodo repository. There are the following 3 stages to the publication process, with further details provided below.

The data needs to be formatted to meet a community standard and to include core metadata required by the project.
The data is validated to ensure that the formatting and metadata comply with the required standard.
Data that passes validation is published to the Zenodo data repository.
Metadata is indexed at the SAFE Project website and is accessible through a search API.

Data format

Datasets are submitted as Microsoft Excel spreadsheets, containing the following worksheets:

Summary: simple metadata (authors, access rights) about the dataset
Taxa: a description of the Taxa reported in the dataset
Locations: a description of all sampling locations in the dataset
Data worksheets: the actual data tables, described in the Summary sheet

The formatting details for each worksheet are described here: https://safedata-validator.readthedocs.io/en/latest/data_format/overview.html

Once a dataset has been formatted, researchers can submit the dataset to the SAFE Project website for validation and publication.

Dataset validation

Submitted datasets are validated using the Python program (safedata_validator)[https://github.com/ImperialCollegeLondon/safedata_validator]. This checks that the dataset format is correct and that all the required metadata is provided and consistent. When a dataset fails validation, a report is returned to the submitter to help revise the dataset, otherwise validated datasets are published to Zenodo.

Data publication

Zenodo is a scientific data repository backed by the CERN. Zenodo provides data communities that allow all of the SAFE Project datasets to be collated into a single collection ((https://zenodo.org/communities/safe/). The publication process uses the metadata provided in the submitted file to automatically create a detailed description of the dataset. Zenodo also issues DOIs for published datasets and provides versioning and access control :

Data versioning

Zenodo uses a DOI versioning system that allows sets of dataset records to be grouped. When a dataset is published on Zenodo for the first time, two DOIs are registered:

the record DOI, representing the specific version of the dataset, and
the concept DOI, representing all versions of the record.

Subsequent versions of an upload are then logged under a new DOI. This means that multiple versions of a dataset can be stored and referenced clearly. The concept DOI refers to all versions of the dataset (and the DOI link resolves to the most recent version), while the version DOI refers to a particular instance.

Dataset access

Datasets published to the SAFE Project community on Zenodo can have one of the following three access status:

Open: the dataset is open for download, or
Embargoed: the dataset will become open at a given date in the future, or
Restricted: accessing the dataset requires permission. Zenodo provides a mechanism to request permission from the SAFE Project, which will always be passed onto the specific dataset authors.

In practice, restricted data is little used and a fourth ‘Closed’ status is not accepted for SAFE datasets. Note that a particular dataset concept may have a versions with a mixture of access statuses.

Dataset discovery and indexing

The Zenodo website provides the ability to search the text of record descriptions, including a search tool specifically for the SAFE Project community. However, these searches are unstructured and do not cover all of the metadata contained within published datasets. The SAFE Project website therefore maintains a search API that allows structured queries to be performed on the following:

dates: the start and endpoints of data collection within a dataset,
fields: the text and field type of fields within indidual data tables within a dataset,
authors: the authors of datasets,
text: a free text search of dataset, worksheet and field titles and descriptions.
taxa: the taxa included in a dataset, and
spatial: matching datasets by sampling location.

In addition, the API provides a record endpoint that allows the full record metadata to be downloaded in JSON format.

SAFE data directories

The safedata package stores downloaded datasets, record metadata and key index files within a data directory, which is used as a local repository of the datasets used by an individual researcher. The structure of the directory is critical to the operation of the package and the datasets themselves are under version control: users must not change the structure or the file contents of this directory.

The directory structure is as follows: three index files are stored in the root of the directory, which will also contain a folder named with the concept id number of each dataset that has been downloaded. These concept folders will then contain at least one subfolder giving named with the record number of a downloaded dataset. These subfolders will then contain the record files downloaded from Zenodo and a JSON format file containing the dataset metadata. The directory also contains a JSON format file that that records the base URL of the dataset index website.

For example:

gazetteer.geojson
index.json
location_aliases.csv
url.json
1400561/1400562/1400562.json
1400561/1400562/Psomas_Ant_Pselaphine_SAFE_dataset.xlsx

Andy Aldersley and David Orme

2022-08-01