Not sure if this is the proper place for this, but I would like to just put this out here as feature idea/request.
Currently, the labeled data selection for “Models” is not sufficient for our dataset curation needs. Only being able to select data at the highest resolution of ‘Project’ or ‘Dataset’ for a given Ontology is not enough.
Each of our network models revolve around their own single labeling Project - so all possible data/labels we could want for a “Model” is already within a single project. Furthermore, the data uploaded to a set of 1 or more ‘Datasets’ is only meant for a single labeling Project/Ontology as well, they are not shared across different labeling Projects. This our attempt at following the guidelines for ‘Dataset’ creation.
At the dataset curation step for “Models” it would be really useful to be able to filter labeled data by Metadata fields and their values. We use many Metadata tags within our image data to categorize it and describe it better, with many tags & values being common across images.
Thus there is no single exclusive category that could lend to better Dataset organization, when data is first uploaded a priori.
I propose an idea where we can filter labeled data rows by selecting Metadata fields (checking one or more of its values, if applicable), then a number of images to randomly grab & add to our “Models” versioned dataset. This user-selected number would be restricted to being less than or equal to the number of images that satisfy those set of selected Metadata field/values.
Then, if continuing this process for more Metadata fields/values whose applicable data also satisfy the previous Metadata filtering, we can either select to allow overlapping images be included (overlap) in this set of random sampling, or be excluded (no overlap).
If overlap is selected and, for example, we specify 300 random images from the 1st Metadata field filtering, and 200 random images from the 2nd Metadata field filtering, then we can end up with 500 or less images - just depends on how many images that share overlapping Metadata fields are randomly selected with the 1st filter and 2nd filter. Also, we ensure that all new images randomly selected do not overlap with all other filters and their respective #s - such that all images selected to satisfy the 2nd filter also keep the data simultaneously satisfying the 1st filter at the desired 300 and no more.
If not overlap, and using the same example, then no data should be in a Metadata filter selection more than once. Thus, if we request 300 images for 1st filtering and 200 for 2nd, we should always end up with 500 unique images. The only case we wouldn’t, is if the set of available images is less than what we requested - and then user could just be shown that all that could be selected, was selected, under their requested # of images.
The Metadata field/value filtering and selection of overlap or not, seems like a game of understanding the Unions across all applicable data satisfying those filters. Then selecting as necessary from the proper set that satisfies the desired amount of data rows and filters. This could either be done in order as the user selects their Metadata filters - starting in order of highest priority filter. Or after all have been selected, with or without any priority. I have not thought this thru enough to understand the differences in how these approaches change what would be selected by the sampling, nor the pros/cons.
With a tool like this, we would be able to exactly select what sorts of data appears in our versioned datasets for training and/or experimentation - this would also give great transparency into what our models see and model performance metrics. Hope this helps!
Hi ceubel,
Regarding the idea of data curation for model runs: our Catalog has an interface for filtering and sampling data rows based on multiple metadata fields, annotations, and even similarity based functions. We are planning to build connectors that allow users to curate training data in Catalog and send them to a Model run in the next few months. I think it fits in the “no overlap” version of your described scenario.
Let me know if you have any thoughts on above plans. Thanks for providing good suggestions and context of your use cases.
That is great to hear, thank you for your reply! I think the filtering tools of Catalog will work great for this.
For 1 set of Catalog filters, would you then be able to randomly sample (or manually select) X# of data rows to add to a specific training/benchmark dataset? Then you could repeat this process, for each desired set of filters, sampling and appending a specific number (or all) of the data rows to that same dataset until your are satisfied with its contents.
Idk if this would explicitly solve the ‘overlap / no overlap’ cases - but it would at least allow you to curate a dataset comprised of a specific # of images per a filters configuration. By default, there would just be overlap in this case since the same data row could satisfy 2 or more sets of Catalog filter configurations.
If we could do this sequential data curation process for a given ‘Model’s’ dataset (whether for training, experimentation, benchmark, etc. purpose) of 1) define set of filters in Catalog, 2) sample or select X# of applicable data rows, 3) add to that dataset, 4) repeat (1-3) until satisfied with that dataset curation - then you would be able to prevent ‘overlap’, if desired by the user. For instance, that overlap/no overlap feature could just be another Catalog filter or toggle within step (1)- .ie “show me all data rows NOT already in the dataset I am curating”
P.S.
I would just like to add that being able to specifically curate datasets for 'Model’s could also be useful for deployed model monitoring too, perhaps?
Say we create a ‘Model’ named “Model XYZ Deployment”. Then say every day or hour we upload a new batch of data seen by our deployed model & its predictions. The predictions could either be sourced from the original inference or we would run the batch of data on an instance of our known deployed model and upload them to Labelbox right after. Either way, a Model Run will be for a batch of real-data rows and contain their corresponding predictions and applicable metrics such that we can monitor its status and the data seen. The only way we’d be able to select this ‘real’ batch of data for a dataset would be if we could filter by some custom metadata tag or by datetime. Ideally, we would also do this automatically via SDK, so that monitoring batches are continuous and automated. Once we have captured the valuable metrics and reviewed the predictions, we could then discard this real data, to make room for the next batch streaming in.
I know 'Model’s are supposed to have a unique, versioned dataset and Labelbox doesn’t really have tools for monitoring, but it would be a neat feature to have. Especially if you have all the monitoring data for a given deployed model in the same place to compare and see what your model performance is over time.
Thanks Chris for these suggestions again. It seems like what you need is a way to do sequential data curation into a dataset: 1) curate the dataset based on # of images of metadata tags to keep them at a balanced ratio, and 2) add a certain number of new data rows of that criteria to this dataset in the future.
Labelbox has three high-level data structures for different purposes-- Datasets (used for the initial data rows upload case), Project (used for annotation or upload model-assisted labels), and Model run data (versioned dataset for training).
Regarding the sequential curation you described: I think the model run will be a good fit for that. As I mentioned in the previous reply, we will be working on sending filtered data rows in Catalog → A Model run (this can be done sequentially). For the no overlap use case, it needs a filter to be set for whether a certain model run contains a data row or not.
Regarding model monitoring, this is a really cool use case that we would like to support in near future. Essentially, once we release the SDK support for querying data for the Catalog filters, you will be able to use the same query to retrieve qualified new data and add that to a model run.