Papers and publications

Standardised Versioning of Datasets: a FAIR–compliant Proposal

González–Cebrián, A., Bradford, M., Chis, A.E. et al. Standardised Versioning of Datasets: a FAIR–compliant Proposal. Sci Data 11, 358 (2024). https://doi.org/10.1038/s41597-024-03153-y

Abstract

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature (“major.minor.patch”) and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (d_P, d_E,_PCA, and d_E,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the d_E,_PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.

SMARDY: Zero-Trust FAIR Marketplace for Research Data

Authors: Ion Dorinel Filip, Cosmin Ionite, Alba Gonzalez-Cebrian, Mihaela Balanescu, Ciprian Dobre, Adriana E. Chis, Dave Feenan, Adrian-Alexandru Buga, Ioan-Mihai Constantin, George Suciu, George Iordache, Horacio Gonzalez-Velez

Conference: 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan

DOI: https://doi.org/10.1038/s41597-019-0184-5

Abstract

Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators – community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests – small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine “sees” when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources.

Automatic Versioning of Time Series Datasets: a FAIR Algorithmic Approach

Authors: A. González-Cebrián, L. A. McGuinness, M. Bradford, A. E. Chis and H. González-Vélez

Conference: 2022 IEEE 18th International Conference on e-Science (e-Science), Salt Lake City, UT, USA, 2022

DOI: 10.1109/eScience55777.2022.00034

Abstract

As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances) in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIR versioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust (s(0,1)=1) to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values (s(0,1)<0.75) . Our code and datasets are openly available to enable reproducibility.

Automated Data Versioning Using Statistical Machine Learning

Poster presented by Alba Gonzalez-Cebrian

Conference: SIAM Call 2022: Conference on Mathematics of Data Science (MDS22)

SIAM-Conference-on-Mathematics-of-Data-Science-MDS22_-Poster-Session-Details Download