- SMARDY: Zero-Trust FAIR Marketplace for Research Data
Authors: Ion Dorinel Filip, Cosmin Ionite, Alba Gonzalez-Cebrian, Mihaela Balanescu, Ciprian Dobre, Adriana E. Chis, Dave Feenan, Adrian-Alexandru Buga, Ioan-Mihai Constantin, George Suciu, George Iordache, Horacio Gonzalez-Velez
Conference: 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan
Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators – community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests – small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine “sees” when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources.
- Automatic Versioning of Time Series Datasets: a FAIR Algorithmic Approach
Authors: A. González-Cebrián, L. A. McGuinness, M. Bradford, A. E. Chis and H. González-Vélez
Conference: 2022 IEEE 18th International Conference on e-Science (e-Science), Salt Lake City, UT, USA, 2022
As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances) in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIR versioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust (s(0,1)=1) to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values (s(0,1)<0.75) . Our code and datasets are openly available to enable reproducibility.
- Automated Data Versioning Using Statistical Machine Learning
Poster presented by Alba Gonzalez-Cebrian
Conference: SIAM Call 2022: Conference on Mathematics of Data Science (MDS22)