How to export labeled text in a pdf file

max_hellmann · May 10, 2024, 12:56pm

Hi community,
labelbox is awesome, but I can’t find the possibility to do an export for our first test. I have annotated a pdf and I know the possibility to export (Annotate → Project → Data Rows → Export data v2) the json. But I need an export file that contains the labels (objects) and the marked data (the marked text from the PDF file). How can I perform this export?

PT · May 10, 2024, 1:20pm

Hey @max_hellmann, welcome to the Community!

Glad you are enjoying Labelbox! Could you provide an example of what you defined as marked data ?

Thanks,
PT

b.combs · May 10, 2024, 2:19pm

How large are the PDF files you are labeling? Are they greater than 15 pages?

For the project did you set the data type to “Document”?

For the labeling tools in settings/project ontology, are you using “Text Entity” labels?

max_hellmann · May 10, 2024, 3:52pm

The pdf document has 43 pages and 365kb file size.

I created the project with the document type.
I select a section of text in the pdf file and then annotate it with an object (used ontology as document type → object text entity > name of the object is textelement)

b.combs · May 10, 2024, 4:25pm

So, Labelbox uses Google Document AI meaning that they can only automatically generate a text layer for you if your documents are under 15 pages. This was/is a huge problem for me as well because a majority of my documents are over 15 pages, and I also do not want to introduce any OCR into any aspects of my machine learning, including labeling operations.

You can test this out by finding a document under 15 pages, uploading to catalog, and then annotating. Here you will see the text you labeled in the export.

Moving forward, you’ll need to generate your own text layer for the PDF’s and include them with the PDF’s when you upload in order to be able to work with PDFs greater than 15 pages. There are others here that are more technical that can help you with text layer generation.

Topic		Replies	Views
Labelbox should use PDF.js not Google Document AI Annotate	8	273	April 3, 2024
Rich text pdfs - custom text layer Python SDK import , data-row	1	29	August 1, 2024
OCR textract Annotate	17	109	December 30, 2024
Named Entitiy labels are not exported from Document, only the Bounding Boxes Using Labelbox exports	0	381	November 23, 2022
How can I use labelbox for row-based text classification? Annotate data-row	1	381	April 19, 2024

How to export labeled text in a pdf file

Related topics