Papers and publications

Poster accepted for SIAM conference – MDS22

The “Data versioning using statistical features” poster was accepted in SIAM Conference – MDS22. Read the abstract for more information:

In research settings, it is common to generate slightly modified versions of the same original dataset due to different preprocessing strategies or the selection of certain subsets of observations. Hence, there is a need for automated and objective data and metadata comparison (and versioning) strategies.

We propose a dataset comparison strategy based on the parameters derived from Principal Component Analysis (PCA) models, which could be used to trigger automated versioning mechanisms, integrated as part of new metadata standards. A PCA approach is used to map the relevant information of the datasets to be compared to a space of reduced dimensionality. The statistical features of the resulting PCA models, namely the radius of the scores’ hyperellipsoid of confidence and the correlation between pairs of homologous loading vectors, are then used as quantitative metrics to compute distances between data versions. This approach has been assessed under two scenarios where dataset comparison would be of benefit: a) imputing missing cell values; and b) deleting records from an original data set.

An ANOVA (analysis of variance) test on these parameters shows stability on the selected parameters when comparing the original dataset and modified versions with up to 50% of missing cells and up to a removal of 80% of rows. Future research into the sensitivity of the approach to different changes to the original dataset should be performed.

Find out more about the conference here.

SMARDY paper accepted in The 17th International Conference on Open Repositories

The “SMARDY: A Research Data Marketplace” paper was accepted in The 17th International Conference on Open Repositories on 6th – 9th June 2022, Denver, Colorado, USA.

Open Repositories (OR) is dedicated to providing a welcoming and positive experience for everyone, whether they are in a formal session or a social setting, are taking part in activities online, or are conference staff or hosts. OR participants come from all over the world and bring with them a wide variety of professional, personal and social backgrounds; whatever these may be, we treat colleagues with dignity and respect. We all represent the OR community. OR does not tolerate harassment and discrimination in any form.

Find out more about the conference here.