NER project - text extraction - new line count

lucia.lam.ll1 · June 22, 2022, 5:22pm

Hi there,

Do you have example code for how best to extract the text from the document based on the start and end indexes? Specifically how many characters are new lines? Is that dependent on the operating system used to create the input txt file or operating system used to read in the txt file?

I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?

However, even with that I am still having trouble accurately extracting the right text.

Thanks,
Lucia

rfekry · June 22, 2022, 9:45pm

Hello Lucia,

I’m Ramy from Labelbox Support. I will try to answer your questions below to the best of my ability. To better assist you 1) could you please clarify where is the text coming from? and 2) what is the intention behind extracting this text?

Answers to questions:

Specifically how many characters are new lines?
- New line characters are in general denoted as \n which is considered one character.
Is that dependent on the operating system used to create the input txt file or the operating system used to read in the txt file?
- The new line character may differ from one operating system to the next, what affects the new line character is the operating system creating the text, not the one reading it. It is generally denoted as \n (for Unix/macOS) or \r(for older macs that run very old versions of mac) or sometimes it is \r\n for windows machines. Source: newline - What are the differences between char literals '\n' and '\r' in Java? - Stack Overflow
I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?
- Yes I believe this may the case if you are using a windows machine.

I hope you found this helpful!
-Ramy, Labelbox Support.

lucia.lam.ll1 · June 23, 2022, 9:43am

Hi Ramy,

Thanks for the clarification.

Text was created after using OCR on PDF documents on a windows machine using Python libraries.
Named entity objects were labeled on text documents in LabelBox, the labels were then exported and the objective here is to extract the labeled text.

I was able to correctly extract the labels by using the following for my text documents:
with open(test_file, 'r', encoding='utf-8', newline="\r\n") as f:

Thanks for your help.
Cheers,
Lucia

rfekry · June 23, 2022, 7:33pm

I’m happy you were able to extract the text properly! Please let me know if you have any other questions or please reach out to Labelbox Support and we would be happy to assist you further!

-Ramy, Labelbox Support.

jerome.massot.78 · November 23, 2022, 12:49am

My 2 cents here : as LabelBox is doing the OCR to create the text layer, I think that it could be easier for anyone if the text used by Label Box during the OCR is exported as a single line text and the entities start and end calculated accordingly…

Topic		Replies	Views
OCR textract Annotate	17	109	December 30, 2024
Named Entitiy labels are not exported from Document, only the Bounding Boxes Using Labelbox exports	0	381	November 23, 2022
Hello, I need to upload a .txt file of 54 KB but I cannot Annotate datasets , annotations	1	40	December 3, 2024
Text Entity Classification on HTML Data Using Labelbox	0	407	September 2, 2022
Labelbox should use PDF.js not Google Document AI Annotate	8	269	April 3, 2024

NER project - text extraction - new line count

Related topics