简体   繁体   中英

How to retrieve Numeric data from string in pandas?

Df:

Id   Product
 1   Milk1256 Pack 10x3
 2   Cleaner#45 Pack 13x4s
 3   Milk 45 Small 1X30m 
 4   Cleaner #1379 Small 75s
 5   Cleaner Small 4.45M

I need to create a new column based on the product column. Basically I want multiply the string if it is written as 10X3 my new column would be 30 else the value which has unit like [s,m,...]

Df_Output:

 Id   Product                Vol
 1   Milk1256 Pack 10x3       30
 2   Cleaner#45 Pack 13x4s    52
 3   Milk 45 Small 1X30m      30
 4   Cleaner #1379 Small 75s  75
 5   Cleaner Small 4.45M      4.45

Use str.extract and a regex to get the last number(s), fillna with 1 if only one number and get the product:

regex = r'(?:(\d+)[Xx])?(\d+)\D*$'
df['Vol'] = (df['Product'].str.extract(regex)
             .fillna(1).astype(float)
             .prod(axis=1)
             )

Output:

   Id                  Product  Vol
0   1       Milk1256 Pack 10x3   30
1   2    Cleaner#45 Pack 13x4s   52
2   3      Milk 45 Small 1X30m   30
3   4  Cleaner #1379 Small 75s   75
4   5        Cleaner Small 45M   45

How the regex works:

(?:(\d+)[Xx])?  # optionally capture a number followed by "x" or "X"
(\d+)           # capture last number
\D*$            # anything not digits at the end of the string

Matching decimal numbers: regex = r'(?:(\d+(?:\.\d+)?)[Xx])?(\d+(?:\.\d+)?)\D*$'

Matching the AxB or A{M,m,s} (explicit units) format: regex = r'(\d+)[Xx](\d+)|(\d+(?=[Mms]\b))'

Example with the decimal numbers regex:

   Id                  Product    Vol
0   1    Milk1256 Pack 10x3.33  33.30
1   2    Cleaner#45 Pack 13x4s  52.00
2   3    Milk 45 Small 1.5X30m  45.00
3   4  Cleaner #1379 Small 75s  75.00
4   5      Cleaner Small 4.45M   4.45

Another solution with traditional for-loop

import pandas as pd
import string
data = [['1','Milk1256 Pack 4.4x3'],['2','Cleaner#45 Pack 13x4s']]

def find_value(str):
    # get all lower case and upper case alphabets except x
    lc = list(string.ascii_lowercase.replace('x',''))
    uc = string.ascii_uppercase.replace('X','')
    st = str.split(' ')
    st1=''
    # from the 3rd column in the df, remove all chars except x
    for i in st[2]:
        if i not in lc and i not in uc:
            st1+=i

    a=''
    b=''
    f=0
    # find the two values to be multiplied
    for i in st1:
        if i != 'x' and f==0:
            a+=i
        elif i=='x':
            f=1
        else:
            b+=i
    # if there is no second number, multiply by 1
    if b=='':
        b=int('1')
    return float(a)*float(b)  
    

df = pd.DataFrame(data, columns = ['id', 'product'])
df['value'] = df['product'].apply(find_value)
print(df)

output

id                product  value
0  1    Milk1256 Pack 4.4x3   13.2
1  2  Cleaner#45 Pack 13x4s   52.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM