Labelbox should use PDF.js not Google Document AI

b.combs · March 26, 2024, 7:26pm

Hey,

My team recently looked into the new “automatic text layer” generation that happens when we upload a PDF to labelbox for annotation. We are disappointed to see that Labelbox has decided to make a complicated and expensive solution out of text extraction from PDF’s by using Google Document AI instead of just PDF.js. This would save Labelbox a ton of time, money, and avoid using shoddy OCR technology. What is worse is that most of our documents are going to be over the 15 page limit that Google Document AI applies.

Labelbox has gone from one of my favorite products to something we probably can’t use unless Google Document AI is scrapped for PDF.js

gunderwood · March 26, 2024, 7:36pm

Hey,
You are definitely able to use your own text layer as you import data rows if that helps! You just need to also have the text layer in our format. Are you able to generate the text layer through PDF.js

Thanks,
Gabe

b.combs · March 28, 2024, 5:42pm

Hey @gunderwood I super appreciate your investment in my concern/issue and for critically thinking about a solution that will work for me moving forward. I want to add some additional context to what my ideal solution is so we can keep trying to figure something out together!

We want to select a piece of text in a PDF document in a way that:

is aware of the PDF layout (cell wrapping, dual columns, etc.),
allows text to be pulled from the underlying document itself (with no OCR),
and is able to export that text to a format for machine learning outside of the labeling platform.

All while being as convenient as possible for the annotator - something I know Labelbox’s product team is also very keen on!

Note: for scanned or photographed PDF’s, as well as images, we will plan to use OCR, we’re not unreasonable We only want to use OCR to get something into a text format when it isn’t that already - so that it can then be parsed on a computer.

We don’t want to use OCR to take a text format that is already being parsed by a computer, convert it to an image, and then spend a bunch of resources to turn it back into a different, worse text format.

I am happy to jump on a zoom call with you, if you’d like to discuss, screen share, brainstorm, etc. Thanks again for your attention!

gunderwood · March 28, 2024, 8:42pm

Hey Brent,
No problem at all I created a jira ticket for tracking! But I will set up a call on Monday for early next week to go over this!

Thanks,
Gabe

b.combs · April 2, 2024, 3:13am

Hey @gunderwood - we had a breakthrough on our side using MuPDF to extract text from a native pdf using the coordinate info we receive from Labelbox outputs.This means I’m solved - however I think what’s best for the Labelbox product remains to be a solution that does not require Google Document AI. Obviously this is a personal opinion - I love Labelbox either way.

gunderwood · April 2, 2024, 2:58pm

Hey @b.combs! That is great news! Do you care to share the script for the conversion so others can see? Do you still want our meeting?

b.combs · April 2, 2024, 3:05pm

If it’s not too much trouble I would love to meet with you. And yeah I’ll share shortly.

gunderwood · April 2, 2024, 3:23pm

Yea, no problem at all. I’ll keep the meeting on!

b.combs · April 3, 2024, 1:29am

github.com

pymupdf/PyMuPDF-Utilities/blob/master/textbox-extraction/textbox-extract-1.py

"""
Script showing how to select only text that is contained in a given rectangle
on a page.

We use the page method 'get_text("words")' which delivers a list of all words.
Every item contains the word's rectangle (given by its coordinates, not as a
fitz.Rect in this case).
From this list we subselect words positioned in the given rectangle (or at
least intersect).
We sort this sublist by ascending y-ccordinate, and then by ascending x value.
Each original line of the rectangle is then reconstructed using the itertools
'groupby' function.

Remarks
-------
1. The script puts words in the same line, if the y1 value of their bbox are
   *almost* equal. Allowing more tolerance here is imaginable, e.g. by
   taking the fitz.IRect of the word rectangles instead.

2. Reconstructed lines will contain words with exactly one space between them.

This file has been truncated. show original

Topic		Replies	Views
Rich text pdfs - custom text layer Python SDK import , data-row	1	64	August 1, 2024
OCR textract Annotate	17	210	December 30, 2024
How to export labeled text in a pdf file Annotate exports , data-row	4	223	May 10, 2024
How To: Generate a layer PDF PyMuPDF // Mistral OCR How To mistral , pdf , ocr	0	133	August 29, 2025
Named Entitiy labels are not exported from Document, only the Bounding Boxes Using Labelbox exports	0	399	November 23, 2022

Labelbox should use PDF.js not Google Document AI

Related topics