Extract number from text using Python

Question

I have extracted text from an image using tesseract, and now have the entire text. However, I only want to extract company number “123456”, which is a random 6-digit number. I want to use this in order to save the file as this company number, so they can be identified more easily.

My Question: If I have a text containing bytes and unicode, what is the easiest way to extract this 6-digit number?

Answer 1

You can use regex:

import re

string = 'some random text with a 6-digit number 123456 somewhere'
res = re.findall(r'\b\d{6}\b', string)
print(res)

Output:

['123456']

Explanation:

\d{6} : match exactly 6 digits
\b : ensure partial numbers are not matched, eg don't get "123456" from "1234567"

Answer 2

An example of the text you are trying to convert would help.

However, you could easily extract the subset of characters that match a given criteria:

s = 'abc12ÄÄ34$$'  # sample string
digits = [ch for ch in s if ch.isnumeric()] # returns a list with only the numeric charcters
digits = ''.join(digits)  # if you want a string

Extract number from text using Python

Question

2 answers

solution1
0 2021-12-30 09:51:25

solution2
-1 2021-12-30 09:51:49

Extract number from text using Python

Question

2 answers

solution1 0 2021-12-30 09:51:25

solution2 -1 2021-12-30 09:51:49

solution1
0 2021-12-30 09:51:25

solution2
-1 2021-12-30 09:51:49