Rich text pdfs - custom text layer

lucamlouzada · July 31, 2024, 5:26pm

Hello,
I am trying to use the platform to annotate pdf files, which are already true pdfs/rich text. However when uploading them as documents, Labelbox performs OCR which is not ideal (as there is also a 15 page limit). One alternative is to extract the text and import as text data rows, but then I lose all the formatting. I have been trying to generate a custom text layer in Python but can’t get it to match the Labelbox format, it always says the layer is invalid even though the metadata has “valid = True”. The sample scripts they provide only accept OCR outputs as inputs, not pdfs with rich text. Any suggestions?

Thanks

smutta · August 1, 2024, 5:28am

Hello there!

Currently, we only support OCR-generated forms. Your custom textLayer should conform to the schema outlined in our documentation at JSON schema Reference.

Thanks!

Topic		Replies	Views
How to export labeled text in a pdf file Annotate exports , data-row	4	141	May 10, 2024
Labelbox should use PDF.js not Google Document AI Annotate	8	259	April 3, 2024
OCR textract Annotate	17	109	December 30, 2024
Access Denied when creating data rows via SDK; rows successfully imported after reprocessing Python SDK import , data-row	13	52	August 15, 2024
How can I use labelbox for row-based text classification? Annotate data-row	1	378	April 19, 2024

Rich text pdfs - custom text layer

Related topics