We have some code that syncs new images to Labelbox. It gets the files from the source storage, gets the files in Labelbox by using dataset.export_data_rows() and adds the missing files.
We sometimes get duplicates in the Labelbox dataset, which seems to come from the api not returning the “real-time” situation byt having some lag.
When we create the datarows in the dataset we set the files original url (in Azure blob storage) to the External id -field. This helps us compare the source blob storage to what is already in Labelbox.
Is the update latency for Labelbox api known and is there a way to make this shorter, preferably real-time? Or is there some better way to accomplish this?
I’m Ramy from Labelbox Support! I would like to help with the issue here. I see that you are having some issues with the number of data rows not being up to date with the recently appended data rows. I have a couple of suggestions for you:
- I would suggest using the two lines of code below to create and upload data, the second line will help make sure that the upload has been fully completed when it’s time to pull the data rows.
task = dataset.create_data_rows(assets)
- In terms of using
dataset.export_data_rows() to process data in the SDK I would highly recommend using
dataset.data_rows() instead. There should not be latency between uploading and the list of rows pulled showing the data rows. Was there a particular reason why you chose to use
I can confirm that data_rows() is working. Not sure why I used export_data_rows().