Managing small batches and continuous data streams

Hello, I am looking for recommendations on how to best handle a workflow where images to be annotated are coming in a regular, continuous stream instead of in convenient chunks.

In our application, new images that need to be annotated are being created all day long 24x7 every day. As such, there are no real convenient checkpoints or signals as to when to chunk things together into batches. Instead we are just uploading new images to our data set, individually as they come into our app.

The trick is we want these new images to go into annotation workflow as soon as they are uploaded, but we are unsure about the idea of making a batch per single image. It feels inefficient and hard to manage.

On the other hand, we also don’t want to wait until the end of the hour/day to make batches either as we also need to maintain decent turn around time on the actual annotation workflow and we do not want an image to wait for very long before being annotated.

Has anyone built a workflow for this situation or have ideas on the best way to batch this sort of streaming data upload? What is the minimum “practical” size of a batch? What happens if I have many many small batches (order of 1-10 items)?

Curious to know more about your use case that requires “instant” annotation. This sounds like quite an impressive machine learning loop. Do you also have data being exported into a model(s) this quickly?

Its not about the export side, simply the fact that new data comes in on a continuous basis (in relatively small instant numbers) and the data set itself has not yet reached the threshold where ML can completely take over the processing. However we still need quick turnaround of incoming data, hence it needs to be batched as soon as is practicable so that ready labelers can see and do the work. Our concern is not volume so much as the end to end latency of the labeling process. We want to minimize the amount of time between data collection and finished data analysis (by the human labeler).

Can any of your labelers be trained on how to use catalog to send data along to annotate projects? I am struggling to understand the problem here if export is not automated as well then rushing to label a data row to then have that data row sit regardless seems redundant. There must be something I am missing about your specific use case as to why the speed to label matters so much.

Yes, sorry, I mistyped. Speedy export is as important as speedy import.

But filtering for export is easier than managing small batches which is why we tend to focus on the problems there.