Machine learning datasets on demand with zero-copy clones and snapshots for reproducible experiments

""Machine learning (ML) experimentation is an iterative process based on a combination of code and datasets. Accessing ML datasets for experiments is a major challenge, and in some cases is a significant undertaking, such as climate studies that require Antarctic core ice samples. Fortunately, in most cases sizable and quality data is available in production databases and data warehouses. Cloud based ML services offer algorithms, data labeling and annotation, but fall short on data governance, data management, and experimental data
reproduction needed for ML experiments at scale. Scalable ML model development requires:"
a catalog of ML experimental data,
"on demand versioned ML data clones for each experimental run, systematic data changes prior to a run made from scripts in a code repo,
snapshotting data used in a ML run, for further cloning and run reproduction data provenance reporting for complete understanding of data and potential bias
ML datasets on demand provides the above functionality for on cloud or on-premise ML development, with internal or external datasets, on ML notebooks or machines, with Python / R models, and can be integrated with ML platforms (e.g. HuggingFace, MlFlow, TensorFlow). The technology provides a simple to use but disciplined database management, for reproducible experiments and team collaboration. Elements of the solution include zero-copy cloning, writable data clones, snapshots of data used in experiments and clones of snapshots, integration with MLOps platforms and data provenance reporting.



















By clicking 'Download Now' you agree to our Terms of Use. We take your privacy seriously. For more information please read our Privacy Policy. By registering with the Enterprise Guide you will automatically receive our weekly Product Update and Technology Insider eNewsletters.

Copyright 2021 Enterprise Guide. All Rights Reserved. Terms of Use | Privacy Policy