Hugging Face
Hugging Face (🤗 ) is a platform that allows developers to train and deploy open-source AI models. It's similar to GitHub in providing a space for developers to code and deploy AI applications, including language models, transformers, text2image, and more.
One of the stand-out features of the platform is “🤗 Datasets” – which is a collection of over 5,000 ML datasets that are available for use.
In this guide, we will walk through configuring HuggingFace Datasets with Storj using S3FS until a Storj-native integration pattern is defined.
Prerequisites
Familiarity and account with Hugging Face (see Quick Start Guide)
Familiarity with Colab or equivalent environment to run code in (see Notebooks)
Storj S3 compatible access and secret key (see Getting started)
A bucket created on Storj (see Create buckets)
Setup Storj with S3Fs
Storj will use s3fs in order to work with the Hugging Face APIs.
First, install some dependencies needed.
Next, enter your Storj S3 compatible access and secret key (see Getting started)
Create a bucket (see Create buckets) from the dataset to be stored in. In this walk-through, the bucket will be called my-dataset-bucket
.
Transfer the existing Hugging Face dataset to Storj
If your dataset is already on Hugging Face Hub, you can use the load_dataset_builder function to download and transfer it to Storj. It'll first download raw datasets to your specified cache_dir
, then prepare it to uploaded to Storj using the storage_options
defined previously.
Here we transfer the dataset imdb to Storj.
Save the dataset to Storj
Once you've encoded a dataset, you can persist it using the save_to_disk
method.
Load dataset from Storj
Use the load_from_disk
method so you can download your datasets.