简体   繁体   中英

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer ) from a given text. Although my regex for this particular problem matches the correct format ( please see demo ), I can not extract the correct company register number.

I want to extract HRB 142663 B but I get HRB 142663 .

Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.

import re

def get_handelsregisternummer(string, keyword):

    # https://regex101.com/r/k6AGmq/10
    reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'

    match = re.compile(reg_1)
    handelsregisternummer = match.findall(string)  # list of matched words

    if handelsregisternummer:  # not empty
        return handelsregisternummer[0]
    else:  # no match found
        handelsregisternummer = ""

    return handelsregisternummer

Example text scraped from website. Linebreaks make words attached to each other:

text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""

Apply function:

for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
    handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
    if handelsregisternummer: # if list is not empty anymore, then do...
        handelsregisternummer = keyword + " " + handelsregisternummer
        break
    if not handelsregisternummer:  # if list is empty
        handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}

Afterwards I get:

handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}

But I want this:

handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}

You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:

reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
#            |_________|                                   |___________________|

Then, you need to concatenate, join all the capturing groups matched and returned with findall :

if handelsregisternummer: # if list is not empty anymore, then do...
    handelsregisternummer = " ".join(handelsregisternummer)
    break

See the Python demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM