Extract Volume information from pandas series - Pandas , Regex

Question

I have a Pandas series which can be produced by the below code:

Input:

l = ['abcd 1942 Lmauu 40% 70cl',
    'something again something   1.5 L',
    'some other stuff 45% 70 CL',
    'not the exact data      3LTR',
    'abcd 100Ltud 6%(8)500ML',
    'cdef  6%(8)500 ml',
    'a packet 24 x 27.5 cl (  PET )']
ser = pd.Series(l)

Problem Statement and expected output:

I am trying to extract the Volumes from the series and convert into a dataframe such that the volume would be in 1 column of the dataframe and the unit of measure in the other column, expected output can be reproduced using the below code:

d = {0: {0: '70',
     1: '1.5',
     2: '70',
     3: '3',
     4: '500',
     5: '500',
     6: '27.5'},
     1: {0: 'cl', 1: 'L', 2: 'CL', 3: 'LTR', 4: 'ML', 5: 'ml', 6: 'cl'}}
expected_output = pd.DataFrame(d)

      0    1
0    70   cl
1   1.5    L
2    70   CL
3     3  LTR
4   500   ML
5   500   ml
6  27.5   cl

My try code

Here is what I have tried, i have come very near to what I want but not quite, if you see i dont get the last volume. I think because i have included $ in my regex, but without it I was not able to parse the volume as in this string for example abcd 1942 Lmauu 40% 70cl , 1942 L would have been returned. Also I want the unit of measure only in second column not the first as shown in my output but that is secondary.

print(ser.str.extract(r'((?i)([\d]+?[.])?\d+?[\s+]?(cl$|ml$|ltr$|L$)(?:$))').iloc[:,[0,-1]]) 

        0    2
0    70cl   cl
1   1.5 L    L
2   70 CL   CL
3    3LTR  LTR
4   500ML   ML
5  500 ml   ml
6     NaN  NaN

Please suggest what should I do here.

Answer 1

You may use

r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b'

See the regex demo .

Details

(?i) - case insensitive mode on
\b - a word boundary
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits followed with an optional sequence of a dot and one or more digits
\s* - 0+ whitespaces
(cl|ml|ltr|L) - cl , ml , ltr or L (mind the case insensitive matching)
\b - a word boundary

Test:

>>> ser.str.extract(r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b', expand=True)
      0    1
0  70    cl 
1  1.5   L  
2  70    CL 
3  3     LTR
4  500   ML 
5  500   ml 
6  27.5  cl

Answer 2

It is better to use named capturing groups, so that result columns have meaningful names.

I also simplified a bit your regex and changed units of measure to lower case.

So change your code to:

res = ser.str.extract(r'(?i)(?P<Amount>\d+(?:\.\d+)?)\s?(?P<Unit>[CM]?L|LTR)\b')
res.Unit = res.Unit.str.lower()

The result is:

  Amount Unit
0     70   cl
1    1.5    l
2     70   cl
3      3  ltr
4    500   ml
5    500   ml
6   27.5   cl

Note also that $ in (cl$|ml$|ltr$|L$) is wrong, because at least in one case you have additional text after the unit of measure.

Extract Volume information from pandas series - Pandas , Regex

Question

2 answers

solution1
2 ACCPTED 2020-04-05 10:37:23

solution2
1 2020-04-05 10:54:00

Extract Volume information from pandas series - Pandas , Regex

Question

2 answers

solution1 2 ACCPTED 2020-04-05 10:37:23

solution2 1 2020-04-05 10:54:00

solution1
2 ACCPTED 2020-04-05 10:37:23

solution2
1 2020-04-05 10:54:00