OCR textract

sfanoodi · December 4, 2024, 9:21pm

Hello everyone
I’m trying to label my pdf files. I get the extract to convert my pdfs into JSON format but in its format. However, the conversion code for the labelbox is not working. I have been trying to fix it for hours and no results. Any help is appreciated.

PT · December 5, 2024, 12:26pm

Hey @sfanoodi,

We actually do the extraction for you for PDFs up to 15 pages (ref: Import document data).
If you have larger PDFs, let me know. Are you using the GCP method or the AWS method (ref: GitHub - Labelbox/PDF-OCR-Transform-CLI: CLI tool for transforming third party OCR formats into Labelbox's proprietary pdf text layer format)? What error or issue are you facing specifically?

sfanoodi · December 5, 2024, 3:02pm

hi @PT
Thank you for your response. I spent days trying to figure out the OCR, and I can’t get it to work so far. Yes, I’m aware of that, but my documents are typically 30 to 40 pages.
Im using the AWS method explained in Github. The extract works and convert the document but its not in labelbox json format. So the extract part works, but when I want to convert it to labelbox JSON format, the code runs as successful, but nothing happens! Here is the code I run through cmd:

@echo off
setlocal enabledelayedexpansion
set INPUT_FOLDER=output
set OUTPUT_FOLDER=output_converted
set FORMAT=aws-textract
set CONCURRENCY=1

:: Ensure the output folder exists
if not exist “%OUTPUT_FOLDER%” mkdir “%OUTPUT_FOLDER%”

:: Run the conversion tool
textlayer-win.exe convert --inputFolder “%INPUT_FOLDER%” --format %FORMAT% --outputFolder “%OUTPUT_FOLDER%” --concurrency %CONCURRENCY% > convert_log.txt 2>&1
if errorlevel 1 (
echo Error during conversion. Check convert_log.txt for details.
pause
exit /b 1
)

:: Check if the converted files exist
if not exist “%OUTPUT_FOLDER%” (
echo Conversion completed, but no files were generated. Check convert_log.txt for details.
pause
exit /b 1
)

echo Conversion completed successfully. Converted files are located in “%OUTPUT_FOLDER%”.
pause

sfanoodi · December 5, 2024, 3:08pm

I ran the following code again through command prompt to double check:
C:\Users\Asus\PDF-OCR-Transform-CLI> textlayer-win.exe convert --inputFolder input --format aws-textract --outputFolder output
–cocurrency 2
here is the output error:

=-=-=-=-=-=-=-=-= 1/2 =-=-=-=-=-=-=-=-=
Uploading car-1736448-22.pdf to s3://testpdf7
=-=-=-=-=-=-=-=-= 2/2 =-=-=-=-=-=-=-=-=
Uploading dorobantu-et-al-2024-the-amj-management-research-canvas-a-tool-for-conducting-and-reporting-empirical-research.pdf to s3://testpdf7
Starting Textract OCR for car-1736448-22.pdf
Starting Textract OCR for dorobantu-et-al-2024-the-amj-management-research-canvas-a-tool-for-conducting-and-reporting-empirical-research.pdf
node:internal/errors:841
const err = new Error(message);
^

Error: Command failed: aws textract start-document-text-detection --document-location ‘{“S3Object”:{“Bucket”:“testpdf7”,“Name”:“car-1736448-22.pdf”}}’

Error parsing parameter ‘–document-location’: Expected: ‘=’, received: ‘’’ for input:
‘{S3Object:{Bucket:testpdf7,Name:car-1736448-22.pdf}}’
^

at ChildProcess.exithandler (node:child_process:398:12)
at ChildProcess.emit (node:events:527:28)
at maybeClose (node:internal/child_process:1092:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5) {

code: 252,
killed: false,
signal: null,
cmd: aws textract start-document-text-detection --document-location '{"S3Object":{"Bucket":"testpdf7","Name":"car-1736448-22.pdf"}}'
}

PT · December 5, 2024, 5:54pm

I will check, bear with us.

PT · December 9, 2024, 6:33pm

Status update, this seems to be specific to windows, works on mac, I reproduce on a windows machine.

sfanoodi · December 9, 2024, 7:38pm

You mean I need to use a mac to be able to run this?
If so, is there anything I need to change in the code?

PT · December 9, 2024, 8:40pm

You would need to build the CLI and replace :

github.com

paultancre/PDF-OCR-Transform-CLI/blob/3c182a170866c90f091ebaf684e8cc59f2568f84/src/commands/convert/convert-textract.ts#L89


      
          `aws s3 cp ${inputFolder}/${pdfFilename} s3://${bucketName}`,
          (error) => {
            if (error) {
              throw error;
            }
          
            // The PDF was successfully uploaded to S3
            // Run Textract OCR on the pdf
            console.log(`Starting Textract OCR for ${pdfFilename}`);
            exec(
              `aws textract start-document-text-detection --document-location "{\"S3Object\":{\"Bucket\":\"${bucketName}\",\"Name\":\"${pdfFilename}\"}}" --debug`,
              async (error, stdout) => {
                if (error) {
                  throw error;
                }
          
                const jobId = JSON.parse(stdout).JobId;
          
                // Build the textract output
                const textractResult = await buildTextractOutput(jobId);

I made a fork test, the issue is around the way windows parse the single quotes.

I would need to test this further (the repo did not update the pre-built cli, so don’t lose your time using it).
I will update tomorrow.

sfanoodi · December 9, 2024, 8:47pm

Appreciate your help! I also have a MacBook as well. Does the mac works well without any issues?

PT · December 9, 2024, 9:06pm

Yep, Mac would run without issues

sfanoodi · December 18, 2024, 5:05am

So I got a macbook and converted the pdf using extract based on label box schema. I attached the pdf file and json output for your consideration. Next, I imported both the pdf and json to label box. However, it says the json layer is not a match!!
its taking forever to me to be able to use labelbox for this project. Please help me out!

Here is the code I used to import pdf and json files:

from labelbox import Client

#Initialize the Labelbox client
client = Client(api_key="") 

#Define the assets with metadata
assets = [
    {
        "row_data": {
            "pdf_url": "https://testpdf7.s3.us-east-2.amazonaws.com/PDFs/car-1736448-22.pdf",  # Replace with your actual PDF URL
            "text_layer_url": "https://testpdf7.s3.us-east-2.amazonaws.com/JSONs/car-1736448-22-lb-textlayer.json"  # Replace with your actual text layer URL
        },
        "global_key": "1736448-22",  # Unique identifier for the asset
        "media_type": "PDF",
        "metadata_fields": [
            {"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "Research Paper"},  # Tag (string)
            {"schema_id": "cko8sbczn0002h2dkdaxb5kal", "value": "cko8sbscr0003h2dk04w86hof"},  # Split (enum)
            {"schema_id": "cko8sdzv70006h2dk8jg64zvb", "value": "2024-12-17T12:00:00Z"},  # Capture date/time (datetime)
            {"schema_id": "cm1upo66m00033b6pjlm6rnj8", "value": 5}  # Skip N frames (number)
        ]
    }
]

try:
    # Create a dataset
    dataset_name = "Example Dataset with Metadata"
    dataset = client.create_dataset(name=dataset_name)
    print(f"Dataset created: {dataset_name} (ID: {dataset.uid})")

    # Upload the assets with metadata
    task = dataset.create_data_rows(assets)
    print("Uploading assets...")

    task.wait_till_done()

    # Print errors if any
    if task.errors:
        print("Errors encountered during upload:")
        for error in task.errors:
            print(error)
    else:
        print("Upload successful with no errors!")

except Exception as e:
    print(f"An unexpected error occurred: {e}")

sfanoodi · December 18, 2024, 5:11am

here is a screenshot of the error:

PT · December 20, 2024, 6:34pm

So the PDF and the layer checks out, I was able to upload the sample you have provided, and it works.

Now, its seems the PDF on your side is not available?

sfanoodi · December 20, 2024, 8:37pm

Thanks! Can you share the code you used to upload the text and pdf to labelbox? Probably thats the issue then.

sfanoodi · December 20, 2024, 8:38pm

Is my script correct way of doing it?

PT · December 23, 2024, 12:28pm

Your script looks ok, I used Azure to test in my case, here is what I did:

API_KEY = os.environ.get('LABELBOX')
client = Client(api_key=API_KEY)

#add the integration
organization = client.get_organization()
iam_integration = organization.get_iam_integrations()[12]
iam_integration.name

dataset = client.create_dataset(name="PDF - Manual Layer - Azure", iam_integration=iam_integration)

assets = [
  {
    "row_data": {
      "pdf_url": "https://mlse.blob.core.windows.net/lb-pt-org/pdf_to_layer_test.pdf",
      "text_layer_url": "https://mlse.blob.core.windows.net/lb-pt-org/layer_pdf.json"
    }
  }
]

task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)

sfanoodi · December 30, 2024, 5:53pm

Thank you for your response. However, it doesn’t work for me. I’m not a programmer nor a computer scientist trying to use your platform for so long and haven’t found a solution yet. Do you have any premium support? I would appreciate it if someone could help me go through the process and see what I’m doing wrong so that I don’t get the same results you get. I have a deadline for this project and its not progressing.
Thanks

PT · December 30, 2024, 6:50pm

So as mentioned in the previous reply I sent you would need to set the CORS so we can retrieve the pdf.

Here the instruction: Configure CORS
Once this is done and resolved, let me know if still have the previous error with the layer.

Topic		Replies	Views
Labelbox should use PDF.js not Google Document AI Annotate	8	273	April 3, 2024
How to export labeled text in a pdf file Annotate exports , data-row	4	143	May 10, 2024
Rich text pdfs - custom text layer Python SDK import , data-row	1	29	August 1, 2024
How can I export my annotation to AWS Sagemaker/Comprehend? Using Labelbox exports , datasets	6	381	June 22, 2023
Converting Labelbox annotation export to YOLO v8 PyTorch Using Labelbox exports	3	1049	April 24, 2023

OCR textract

Related topics