How to export labeled text in a pdf file

Hi community,
labelbox is awesome, but I can’t find the possibility to do an export for our first test. I have annotated a pdf and I know the possibility to export (Annotate → Project → Data Rows → Export data v2) the json. But I need an export file that contains the labels (objects) and the marked data (the marked text from the PDF file). How can I perform this export?

Hey @max_hellmann, welcome to the Community!

Glad you are enjoying Labelbox! Could you provide an example of what you defined as marked data ?

Thanks,
PT

1 Like

How large are the PDF files you are labeling? Are they greater than 15 pages?

For the project did you set the data type to “Document”?

For the labeling tools in settings/project ontology, are you using “Text Entity” labels?

The pdf document has 43 pages and 365kb file size.

I created the project with the document type.
I select a section of text in the pdf file and then annotate it with an object (used ontology as document type → object text entity > name of the object is textelement)

So, Labelbox uses Google Document AI meaning that they can only automatically generate a text layer for you if your documents are under 15 pages. This was/is a huge problem for me as well because a majority of my documents are over 15 pages, and I also do not want to introduce any OCR into any aspects of my machine learning, including labeling operations.

You can test this out by finding a document under 15 pages, uploading to catalog, and then annotating. Here you will see the text you labeled in the export.

Moving forward, you’ll need to generate your own text layer for the PDF’s and include them with the PDF’s when you upload in order to be able to work with PDFs greater than 15 pages. There are others here that are more technical that can help you with text layer generation.