For our model training, we utilize various kinds of linear and nonlinear image transformations to expand our training dataset.
The ground-truth labels for this augmented data would be derived and transformed from the ground-truth labels generated in Labelbox. We could then upload these augmentations and their augmented labels back into Labelbox to be included in curated/versioned data sets with “Models”.
But I see no features or special handling of ‘augmented’ data. Also, uploading already-labeled data will consume limited/valuable upload space within Lablebox.
We plan to use specific ‘metadata’ tags and fields to tag our augmented data. However, controlling which augmented data, and how many images, is a challenge with how “Models” are currently configured, see this post for more.
It would be nice to be able to then filter augmented data vs. non-augmented data for analyzing “Model Runs” inferences and metrics. We’d use the entire versioned “Model” dataset for training (including the augmented data) but then would likely only care about how our model performs on the real, non-augmented data.
I was wondering if there are any specific plans or advice for using augmented data within Labelbox.
Hi Chris. This is a valid point.
It is true that currently model-run datasets are designed to contain mostly real, non-augmented data, and only support metrics and predictions for them. The rationale is that augmentation sometimes can expand the dataset by a lot (10x or 100x depending on how much you augment it), and many users will generate augmented data on the fly during training instead of storing them.
That being said, augmented data is an important part of data versioning for model training. We will consider adding metadata filter support in model run.
At the same time, a hacky workaround to distinguish augmented/non-augmented data could be: upload custom metrics in your predictions (say 0 for augmented data, 1 for non-augmented data) and use the metrics filter on this customer metric to separate them.
Hello @kyang thanks for your reply!
Ya, I think if there was at least a way to version what types of augmentations and how many were produced for a given training dataset would be really useful.
It does seem wasteful to store all these augmented images within Labelbox - but having at least metadata on the augmentations that comprise your dataset would add great transparency to your dataset distribution and its final curation state.
Having a ‘metrics’ visualization for viewing your dataset composition - filterable by different selectable metadata views (whether from media attributes, custom metadata, ontology data, etc.) - might be useful for actually seeing what your dataset consists of. Which would support augmentation experimentation to see which augmentation curation best boosts your model performance on real data. We use some non-linear augmentations that can hurt if not balanced properly.
Also, knowing which real images in Labelbox are the ‘parent’ to how many augmentations of each augmentation function/type (& its input parameters, if applicable) would be the closest thing to dataset versioning without actually storing the images. I don’t know what this feature would actually look like, but just a thought.