简体   繁体   中英

Extracting the integers from a column of strings

I have 2 dataframes: longdf, and shortdf. Longdf is the 'master' list and I need to basically match values from shortdf to longdf, those that match, replace values in other columns. Both longdf and shortdf need extensive data cleaning.

The goal is to reach the df 'goal.' I was trying to use a for loop where I wanted to 1) extract all number in the df cell, and 2) strip the blank/cell spaces from the cell. First: How come this for loop doesn't work? Second: Is there a better way to do this?

import pandas as pd

a = pd.Series(['EY', 'BAIN', 'KPMG', 'EY'])
b = pd.Series(['   10wow this is terrible data8 ', '10/ USED TO BE ANOTHER NUMBER/ 2', ' OMG 106 OMG ', '    10?7'])
y = pd.Series(['BAIN', 'KPMG', 'EY', 'EY' ])
z = pd.Series([108, 102, 106, 107 ])

goal = pd.DataFrame
shortdf = pd.DataFrame({'consultant': a, 'invoice_number':b})
longdf = shortdf.copy(deep=True)
goal = pd.DataFrame({'consultant': y, 'invoice_number':z})

shortinvoice = shortdf['invoice_number']
longinvoice = longdf['invoice_number']

frames = [shortinvoice, longinvoice]
new_list=[]

for eachitemer in frames:
    eachitemer.str.extract('(\d+)').astype(float) #extracing all numbers in the df cell
    eachitemer.str.strip() #strip the blank/whitespaces in between the numbers
    new_list.append(eachitemer)

new_short_df = new_list[0]
new_long_df = new_list[1]

If I understand correctly, you want to take a series of strings that contain integers and remove all the characters that aren't integers. You don't need a for-loop for this. Instead, you can solve it with a simple regular expression.

b.replace('\D+', '', regex=True).astype(int)

Returns:

0    108
1    102
2    106
3    107

The regex replaces all characters that aren't numbers (denoted by \\D ) with an empty string, removing anything that's not a number. .astype(int) converts the series to the integer type. You can merge the result into your final dataframe as normal:

result = pd.DataFrame({
    'consultant': a, 
    'invoice_number': b.replace('\D+', '', regex=True).astype(int)
})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM