简体   繁体   中英

Python regex is not extracting needed data

I am new to Python and regex. I was trying to write an expression that will extract either integer/floating number along with its units KG / KILOGRAMS in the following text.

Data:
adfa0.4 KG, ACD* $ am ----------> Ans expected is -> 0.4 KG 
$#@+0.4 KG, ACD* $ am ----------> Ans expected is -> +0.4 KG
fdafa+000.4 KG, ACD* $ am ----------> Ans expected is -> +000.4 KG
ased+00.400 KG, ACD* $ amf ----------> Ans expected is -> +00.400 KG
a1 KG, QD ----------> Ans expected is -> 1 KG
0.4 KG, ACD* $ am ----------> Ans expected is -> 0.4 KG
+0.4 KG, ACD* $ am ----------> Ans expected is -> +0.4 KG
+000.4 KG, ACD* $ am ----------> Ans expected is -> +000.4 KG
+00.400 KG, ACD* $ am ----------> Ans expected is -> +00.400 KG
1 KG, QD ----------> Ans expected is -> 1 KG
1.2 KG, UNK ----------> Ans expected is -> +1.2 KG
1/0.5 KG BID ----------> Ans expected is -> 0.5 KG
10-325KG ----------> Ans expected is -> 325 KG
150KG PER DAY ----------> Ans expected is -> 150 KG
15 KILLOGRAM----------> Ans expected is -> 15 KG (Killogram must be changed as KG)
15KILLOGRAM----------> Ans expected is -> 15 KG(Killogram must be changed as KG)
-15KILLOGRAM----------> Ans expected is -> -15 KG(Killogram must be changed as KG)

I tried with findall() using [-+]?\\d*\\.\\d+|\\d+\\s\\w+ , but it is not giving the desired results.

Try this. Instead of data variable you can give your strings. I tried some of the strings that you gave and it worked.

data= '150KG PER DAY'
#'-0.15KILLOGRAM'

p = '([\-\+\.\d]+)'

value= re.search(p,data).group(1)

final = value + ' ' +'KG'

print(final)

Try \\d+.*\\d+\\s*KG. Notice the use of * to include all cases in the text (is there always a space between the quantity and units?).

You may try the following regex pattern

[+-]?\d+\.?\d*\s?[a-zA-Z]+

And there is some reference to you

\\.? matches the character '.' literally (case sensitive)

. matches any character (except for line terminators)

\\w matches any word character (equal to [a-zA-Z0-9_])

| should use as (a|b) unless you look for alternative

In a regular expression with alternations, like A|B , the engine will be satisfied if it finds A , and never try B . Your first problem is thus that you want to switch the order of the alternates, to prefer a match with a unit over one without one.

The next problem is that you are not including the optional sign and decimal point in the expression which matches a number with a unit after it.

You have some test cases where there is no whitespace before the unit, but your regex doesn't allow for that.

The final problem is that you want "kilogram" (even misspelled!) to be mapped to a normalized unit. Regex can't do that, but you can add some code to achieve that.

We can refactor your regex into one which simply makes the unit expression optional, and captures the parts into named groups.

r = re.compile(r'(?P<num>[-+]?\d+(?:\.\d+)?)\s?(?P<unit>\w+)?')
for match in re.finditer(data):
    d = match.groupdict()
    if d['unit'].lower() in ['kilogram', 'killogram']:
        d['unit'] = 'KG'
    print(d['num'] + ' ' + d['unit'])

To make this explicit, (?P<name>...) captures the matching string into a group called name . The function match.groupdict() returns a dictionary of these named capture groups, where the key is the group's name and the value is the captured string.

You can use regex as follows:

import re

data = """
adfa0.4 KG, ACD* $ am 
$#@+0.4 KG, ACD* $ am 
fdafa+000.4 KG, ACD* $ am 
ased+00.400 KG, ACD* $ amf 
a1 KG, QD
0.4 KG, ACD* $ am 
+0.4 KG, ACD* $ am 
+000.4 KG, ACD* $ am 
+00.400 KG, ACD* $ am 
1 KG, QD 
1.2 KG, UNK 
1/0.5 KG BID  
10-325KG 
150KG PER DAY 
15 KILLOGRAM
15KILLOGRAM
-15KILLOGRAM
"""
res = []
unit_kg = "KG"
for _ in re.findall(pattern="[-+]?[\d.]{1,}[( |KG|KILLOGRAM)]+", string=data):
    if 'KILLOGRAM' in _:
        if ' ' not in _:
            _ = _.replace("KILLOGRAM", " " + unit_kg)
        else:
            _ = _.replace("KILLOGRAM", unit_kg)
    res.append(_)

print res

output:

['0.4 KG', '+0.4 KG', '+000.4 KG', '+00.400 KG', '1 KG', '0.4 KG', '+0.4 KG', '+000.4 KG', '+00.400 KG', '1 KG', '1.2 KG', '0.5 KG ', '-325KG ', '150KG ', '15 KG', '15 KG', '-15 KG']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM