简体   繁体   中英

Extract number from text using Python

I have extracted text from an image using tesseract, and now have the entire text. However, I only want to extract company number “123456”, which is a random 6-digit number. I want to use this in order to save the file as this company number, so they can be identified more easily.

My Question: If I have a text containing bytes and unicode, what is the easiest way to extract this 6-digit number?

You can use regex:

import re

string = 'some random text with a 6-digit number 123456 somewhere'
res = re.findall(r'\b\d{6}\b', string)
print(res)

Output:

['123456']

Explanation:

  • \d{6} : match exactly 6 digits
  • \b : ensure partial numbers are not matched, eg don't get "123456" from "1234567"

An example of the text you are trying to convert would help.

However, you could easily extract the subset of characters that match a given criteria:

s = 'abc12ÄÄ34$$'  # sample string
digits = [ch for ch in s if ch.isnumeric()] # returns a list with only the numeric charcters
digits = ''.join(digits)  # if you want a string

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM