Do you have example code for how best to extract the text from the document based on the start and end indexes? Specifically how many characters are new lines? Is that dependent on the operating system used to create the input txt file or operating system used to read in the txt file?
I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?
However, even with that I am still having trouble accurately extracting the right text.
I’m Ramy from Labelbox Support. I will try to answer your questions below to the best of my ability. To better assist you 1) could you please clarify where is the text coming from? and 2) what is the intention behind extracting this text?
Answers to questions:
Specifically how many characters are new lines?
New line characters are in general denoted as \n which is considered one character.
Is that dependent on the operating system used to create the input txt file or the operating system used to read in the txt file?
The new line character may differ from one operating system to the next, what affects the new line character is the operating system creating the text, not the one reading it. It is generally denoted as \n (for Unix/macOS) or \r(for older macs that run very old versions of mac) or sometimes it is \r\n for windows machines. Source: newline - What are the differences between char literals '\n' and '\r' in Java? - Stack Overflow
I had trouble extracting the text but then I noticed that newline might be counted as \r\n. Is this true?
Yes I believe this may the case if you are using a windows machine.
I hope you found this helpful!
-Ramy, Labelbox Support.
Text was created after using OCR on PDF documents on a windows machine using Python libraries.
Named entity objects were labeled on text documents in LabelBox, the labels were then exported and the objective here is to extract the labeled text.
I was able to correctly extract the labels by using the following for my text documents: with open(test_file, 'r', encoding='utf-8', newline="\r\n") as f:
I’m happy you were able to extract the text properly! Please let me know if you have any other questions or please reach out to Labelbox Support and we would be happy to assist you further!
My 2 cents here : as LabelBox is doing the OCR to create the text layer, I think that it could be easier for anyone if the text used by Label Box during the OCR is exported as a single line text and the entities start and end calculated accordingly…