Labelbox should use PDF.js not Google Document AI


My team recently looked into the new “automatic text layer” generation that happens when we upload a PDF to labelbox for annotation. We are disappointed to see that Labelbox has decided to make a complicated and expensive solution out of text extraction from PDF’s by using Google Document AI instead of just PDF.js. This would save Labelbox a ton of time, money, and avoid using shoddy OCR technology. What is worse is that most of our documents are going to be over the 15 page limit that Google Document AI applies.

Labelbox has gone from one of my favorite products to something we probably can’t use unless Google Document AI is scrapped for PDF.js :frowning:

You are definitely able to use your own text layer as you import data rows if that helps! You just need to also have the text layer in our format. Are you able to generate the text layer through PDF.js


Hey @gunderwood I super appreciate your investment in my concern/issue and for critically thinking about a solution that will work for me moving forward. I want to add some additional context to what my ideal solution is so we can keep trying to figure something out together!

We want to select a piece of text in a PDF document in a way that:

  • is aware of the PDF layout (cell wrapping, dual columns, etc.),
  • allows text to be pulled from the underlying document itself (with no OCR),
  • and is able to export that text to a format for machine learning outside of the labeling platform.

All while being as convenient as possible for the annotator - something I know Labelbox’s product team is also very keen on!

Note: for scanned or photographed PDF’s, as well as images, we will plan to use OCR, we’re not unreasonable :slight_smile: We only want to use OCR to get something into a text format when it isn’t that already - so that it can then be parsed on a computer.

We don’t want to use OCR to take a text format that is already being parsed by a computer, convert it to an image, and then spend a bunch of resources to turn it back into a different, worse text format.

I am happy to jump on a zoom call with you, if you’d like to discuss, screen share, brainstorm, etc. Thanks again for your attention!

Hey Brent,
No problem at all I created a jira ticket for tracking! But I will set up a call on Monday for early next week to go over this!


Hey @gunderwood - we had a breakthrough on our side using MuPDF to extract text from a native pdf using the coordinate info we receive from Labelbox outputs.This means I’m solved - however I think what’s best for the Labelbox product remains to be a solution that does not require Google Document AI. Obviously this is a personal opinion - I love Labelbox either way.

1 Like

Hey @b.combs! That is great news! Do you care to share the script for the conversion so others can see? Do you still want our meeting?

If it’s not too much trouble I would love to meet with you. And yeah I’ll share shortly.

Yea, no problem at all. I’ll keep the meeting on! :slight_smile: