"Benchmarks" or "Consensus"? And how to add gold standard labels using Python programmatically

Hi, I have a dataset with gold standard labels. Now I want to annotate them with at least two annotators and calculate the agreement 1) among the annotators and 2) between the annotators (when there’s agreement) and the gold standard labels, and export the data. Few questions here:

  1. First of all, should I pick the “Benchmarks” or “Consensus” project type for this? I think technically I may be able to do it with either, but I’m wondering which project type is a better choice in my scenario?

  2. If I pick the “Benchmarks” project, how can I mark the gold standard in the project? I know how to do this with the UI, but let’s say we have thousands of data rows, we cannot go and “add to benchmark” one by one on the UI. So I wonder how can I do it using Python SDK programmatically?

  3. Let’s say I already specified the gold standard data rows in the “Benchmarks” project (they move to the Done section now), how can I assign at least two annotators to these gold standard data rows for annotation? For “Consensus,” when we set up the project, I see we have an option to choose # of annotators, but I wonder how we can do it for Benchmarks after specifying the gold standard data rows.

Hi @pedram ,

Both quality method would be valid in your case

  • Consensus → import ground truth via the SDK, consensus would calculate IOU from there however you would not have a visual “winner” but you can filter by the creator_id.

  • Benchmark → do the same as Consensus but this time you can choose a Winner label programatically : create_benchmark() (ref : Labelbox Python API reference — Python SDK reference 3.59.0 documentation)

Now, if you need a specific set of labels (2) consensus would be best fitted here.

Hope this helps.

Many thanks,

1 Like