I have a Pandas series which can be produced by the below code:
Input:
l = ['abcd 1942 Lmauu 40% 70cl',
'something again something 1.5 L',
'some other stuff 45% 70 CL',
'not the exact data 3LTR',
'abcd 100Ltud 6%(8)500ML',
'cdef 6%(8)500 ml',
'a packet 24 x 27.5 cl ( PET )']
ser = pd.Series(l)
Problem Statement and expected output:
I am trying to extract the Volumes from the series and convert into a dataframe such that the volume would be in 1 column of the dataframe and the unit of measure in the other column, expected output can be reproduced using the below code:
d = {0: {0: '70',
1: '1.5',
2: '70',
3: '3',
4: '500',
5: '500',
6: '27.5'},
1: {0: 'cl', 1: 'L', 2: 'CL', 3: 'LTR', 4: 'ML', 5: 'ml', 6: 'cl'}}
expected_output = pd.DataFrame(d)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
My try code
Here is what I have tried, i have come very near to what I want but not quite, if you see i dont get the last volume. I think because i have included $
in my regex, but without it I was not able to parse the volume as in this string for example abcd 1942 Lmauu 40% 70cl
, 1942 L
would have been returned. Also I want the unit of measure only in second column not the first as shown in my output but that is secondary.
print(ser.str.extract(r'((?i)([\d]+?[.])?\d+?[\s+]?(cl$|ml$|ltr$|L$)(?:$))').iloc[:,[0,-1]])
0 2
0 70cl cl
1 1.5 L L
2 70 CL CL
3 3LTR LTR
4 500ML ML
5 500 ml ml
6 NaN NaN
Please suggest what should I do here.
You may use
r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b'
See the regex demo .
Details
(?i)
- case insensitive mode on \b
- a word boundary (\d+(?:\.\d+)?)
- Capturing group 1: one or more digits followed with an optional sequence of a dot and one or more digits \s*
- 0+ whitespaces (cl|ml|ltr|L)
- cl
, ml
, ltr
or L
(mind the case insensitive matching) \b
- a word boundary Test:
>>> ser.str.extract(r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b', expand=True)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
It is better to use named capturing groups, so that result columns have meaningful names.
I also simplified a bit your regex and changed units of measure to lower case.
So change your code to:
res = ser.str.extract(r'(?i)(?P<Amount>\d+(?:\.\d+)?)\s?(?P<Unit>[CM]?L|LTR)\b')
res.Unit = res.Unit.str.lower()
The result is:
Amount Unit
0 70 cl
1 1.5 l
2 70 cl
3 3 ltr
4 500 ml
5 500 ml
6 27.5 cl
Note also that $
in (cl$|ml$|ltr$|L$) is wrong, because at least in one case you have additional text after the unit of measure.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.