NER project - text extraction - new line count

Hi there,

Do you have example code for how best to extract the text from the document based on the start and end indexes? Specifically how many characters are new lines? Is that dependent on the operating system used to create the input txt file or operating system used to read in the txt file?

I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?

However, even with that I am still having trouble accurately extracting the right text.

Thanks,
Lucia

Hello Lucia,

I’m Ramy from Labelbox Support. I will try to answer your questions below to the best of my ability. To better assist you 1) could you please clarify where is the text coming from? and 2) what is the intention behind extracting this text?

Answers to questions:

  • Specifically how many characters are new lines?

    • New line characters are in general denoted as \n which is considered one character.
  • Is that dependent on the operating system used to create the input txt file or the operating system used to read in the txt file?

  • I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?

    • Yes I believe this may the case if you are using a windows machine.

I hope you found this helpful!
-Ramy, Labelbox Support.

Hi Ramy,

Thanks for the clarification.

  1. Text was created after using OCR on PDF documents on a windows machine using Python libraries.
  2. Named entity objects were labeled on text documents in LabelBox, the labels were then exported and the objective here is to extract the labeled text.

I was able to correctly extract the labels by using the following for my text documents:
with open(test_file, 'r', encoding='utf-8', newline="\r\n") as f:

Thanks for your help.
Cheers,
Lucia

I’m happy you were able to extract the text properly! Please let me know if you have any other questions or please reach out to Labelbox Support and we would be happy to assist you further!

-Ramy, Labelbox Support.

My 2 cents here : as LabelBox is doing the OCR to create the text layer, I think that it could be easier for anyone if the text used by Label Box during the OCR is exported as a single line text and the entities start and end calculated accordingly…