Managing small batches and continuous data streams

adam · October 13, 2023, 5:56pm

Hello, I am looking for recommendations on how to best handle a workflow where images to be annotated are coming in a regular, continuous stream instead of in convenient chunks.

In our application, new images that need to be annotated are being created all day long 24x7 every day. As such, there are no real convenient checkpoints or signals as to when to chunk things together into batches. Instead we are just uploading new images to our data set, individually as they come into our app.

The trick is we want these new images to go into annotation workflow as soon as they are uploaded, but we are unsure about the idea of making a batch per single image. It feels inefficient and hard to manage.

On the other hand, we also don’t want to wait until the end of the hour/day to make batches either as we also need to maintain decent turn around time on the actual annotation workflow and we do not want an image to wait for very long before being annotated.

Has anyone built a workflow for this situation or have ideas on the best way to batch this sort of streaming data upload? What is the minimum “practical” size of a batch? What happens if I have many many small batches (order of 1-10 items)?

b.combs · April 23, 2024, 3:42pm

Curious to know more about your use case that requires “instant” annotation. This sounds like quite an impressive machine learning loop. Do you also have data being exported into a model(s) this quickly?

adam · April 23, 2024, 6:04pm

Its not about the export side, simply the fact that new data comes in on a continuous basis (in relatively small instant numbers) and the data set itself has not yet reached the threshold where ML can completely take over the processing. However we still need quick turnaround of incoming data, hence it needs to be batched as soon as is practicable so that ready labelers can see and do the work. Our concern is not volume so much as the end to end latency of the labeling process. We want to minimize the amount of time between data collection and finished data analysis (by the human labeler).

b.combs · April 23, 2024, 6:14pm

Can any of your labelers be trained on how to use catalog to send data along to annotate projects? I am struggling to understand the problem here if export is not automated as well then rushing to label a data row to then have that data row sit regardless seems redundant. There must be something I am missing about your specific use case as to why the speed to label matters so much.

adam · April 24, 2024, 5:38pm

Yes, sorry, I mistyped. Speedy export is as important as speedy import.

adam · April 24, 2024, 5:40pm

But filtering for export is easier than managing small batches which is why we tend to focus on the problems there.

Topic		Replies	Views
How to conditionally create a new batch using an existing batch in Catalog to Annotate Python SDK data-row , annotations	3	178	March 13, 2024
Batch with annotations Python SDK import , annotations	1	26	July 21, 2025
Webhooks for batch creation in Labelbox app?	9	474	August 17, 2022
SDK to update single annotation for an already labelled image Python SDK	1	467	October 27, 2022
Is there a way to assign a batch of labels to one annotator? Annotate datasets	1	331	July 4, 2023

Managing small batches and continuous data streams

Related topics