OCR textract

Hello everyone
I’m trying to label my pdf files. I get the extract to convert my pdfs into JSON format but in its format. However, the conversion code for the labelbox is not working. I have been trying to fix it for hours and no results. Any help is appreciated.

Hey @sfanoodi,

We actually do the extraction for you for PDFs up to 15 pages (ref: Import document data).
If you have larger PDFs, let me know. Are you using the GCP method or the AWS method (ref: GitHub - Labelbox/PDF-OCR-Transform-CLI: CLI tool for transforming third party OCR formats into Labelbox's proprietary pdf text layer format)? What error or issue are you facing specifically?

hi @ptancre
Thank you for your response. I spent days trying to figure out the OCR, and I can’t get it to work so far. Yes, I’m aware of that, but my documents are typically 30 to 40 pages.
Im using the AWS method explained in Github. The extract works and convert the document but its not in labelbox json format. So the extract part works, but when I want to convert it to labelbox JSON format, the code runs as successful, but nothing happens! Here is the code I run through cmd:

@echo off
setlocal enabledelayedexpansion
set INPUT_FOLDER=output
set OUTPUT_FOLDER=output_converted
set FORMAT=aws-textract
set CONCURRENCY=1

:: Ensure the output folder exists
if not exist “%OUTPUT_FOLDER%” mkdir “%OUTPUT_FOLDER%”

:: Run the conversion tool
textlayer-win.exe convert --inputFolder “%INPUT_FOLDER%” --format %FORMAT% --outputFolder “%OUTPUT_FOLDER%” --concurrency %CONCURRENCY% > convert_log.txt 2>&1
if errorlevel 1 (
echo Error during conversion. Check convert_log.txt for details.
pause
exit /b 1
)

:: Check if the converted files exist
if not exist “%OUTPUT_FOLDER%” (
echo Conversion completed, but no files were generated. Check convert_log.txt for details.
pause
exit /b 1
)

echo Conversion completed successfully. Converted files are located in “%OUTPUT_FOLDER%”.
pause

I ran the following code again through command prompt to double check:
C:\Users\Asus\PDF-OCR-Transform-CLI> textlayer-win.exe convert --inputFolder input --format aws-textract --outputFolder output
–cocurrency 2
here is the output error:

=-=-=-=-=-=-=-=-= 1/2 =-=-=-=-=-=-=-=-=
Uploading car-1736448-22.pdf to s3://testpdf7
=-=-=-=-=-=-=-=-= 2/2 =-=-=-=-=-=-=-=-=
Uploading dorobantu-et-al-2024-the-amj-management-research-canvas-a-tool-for-conducting-and-reporting-empirical-research.pdf to s3://testpdf7
Starting Textract OCR for car-1736448-22.pdf
Starting Textract OCR for dorobantu-et-al-2024-the-amj-management-research-canvas-a-tool-for-conducting-and-reporting-empirical-research.pdf
node:internal/errors:841
const err = new Error(message);
^

Error: Command failed: aws textract start-document-text-detection --document-location ‘{“S3Object”:{“Bucket”:“testpdf7”,“Name”:“car-1736448-22.pdf”}}’

Error parsing parameter ‘–document-location’: Expected: ‘=’, received: ‘’’ for input:
‘{S3Object:{Bucket:testpdf7,Name:car-1736448-22.pdf}}’
^

at ChildProcess.exithandler (node:child_process:398:12)
at ChildProcess.emit (node:events:527:28)
at maybeClose (node:internal/child_process:1092:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5) {

code: 252,
killed: false,
signal: null,
cmd: aws textract start-document-text-detection --document-location '{"S3Object":{"Bucket":"testpdf7","Name":"car-1736448-22.pdf"}}'
}

1 Like

I will check, bear with us.

Status update, this seems to be specific to windows, works on mac, I reproduce on a windows machine.

You mean I need to use a mac to be able to run this?
If so, is there anything I need to change in the code?

You would need to build the CLI and replace :

I made a fork test, the issue is around the way windows parse the single quotes.

I would need to test this further (the repo did not update the pre-built cli, so don’t lose your time using it).
I will update tomorrow.

Appreciate your help! I also have a MacBook as well. Does the mac works well without any issues?

Yep, Mac would run without issues

So I got a macbook and converted the pdf using extract based on label box schema. I attached the pdf file and json output for your consideration. Next, I imported both the pdf and json to label box. However, it says the json layer is not a match!!
its taking forever to me to be able to use labelbox for this project. Please help me out!

Here is the code I used to import pdf and json files:

from labelbox import Client

#Initialize the Labelbox client
client = Client(api_key="") 

#Define the assets with metadata
assets = [
    {
        "row_data": {
            "pdf_url": "https://testpdf7.s3.us-east-2.amazonaws.com/PDFs/car-1736448-22.pdf",  # Replace with your actual PDF URL
            "text_layer_url": "https://testpdf7.s3.us-east-2.amazonaws.com/JSONs/car-1736448-22-lb-textlayer.json"  # Replace with your actual text layer URL
        },
        "global_key": "1736448-22",  # Unique identifier for the asset
        "media_type": "PDF",
        "metadata_fields": [
            {"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "Research Paper"},  # Tag (string)
            {"schema_id": "cko8sbczn0002h2dkdaxb5kal", "value": "cko8sbscr0003h2dk04w86hof"},  # Split (enum)
            {"schema_id": "cko8sdzv70006h2dk8jg64zvb", "value": "2024-12-17T12:00:00Z"},  # Capture date/time (datetime)
            {"schema_id": "cm1upo66m00033b6pjlm6rnj8", "value": 5}  # Skip N frames (number)
        ]
    }
]

try:
    # Create a dataset
    dataset_name = "Example Dataset with Metadata"
    dataset = client.create_dataset(name=dataset_name)
    print(f"Dataset created: {dataset_name} (ID: {dataset.uid})")

    # Upload the assets with metadata
    task = dataset.create_data_rows(assets)
    print("Uploading assets...")

    task.wait_till_done()

    # Print errors if any
    if task.errors:
        print("Errors encountered during upload:")
        for error in task.errors:
            print(error)
    else:
        print("Upload successful with no errors!")

except Exception as e:
    print(f"An unexpected error occurred: {e}")

here is a screenshot of the error:

So the PDF and the layer checks out, I was able to upload the sample you have provided, and it works.

Now, its seems the PDF on your side is not available?

Thanks! Can you share the code you used to upload the text and pdf to labelbox? Probably thats the issue then.

Is my script correct way of doing it?