简体   繁体   中英

Pandas Dataframe not recognizing strings as identical for grouping

I have a dataset with several hundred Account numbers and their Descriptions. It has been imported from Excel into a Python dataframe. The Descriptions, in Excel, have varying numbers of leading and trailing white spaces. The Account number is an integer, Description is an object, and End Balance is a float.

I've tried stripping leading and trailing spaces, replacing multiple white space with single but when I use groupby it does not recognize the Descriptions as identical. If I only groupby Account I get 435 rows, which is correct. If I groupby Description I get over 1100 which is not correct (that's the original number of rows). Grouping by Account and Description yields same result as grouping by Description. This implies to me that the Descriptions are still not seen as identical.

I've also tried not stripping at all and leaving as original with no joy.

Any thoughts of how to make the Descriptions identical?

# Replaces multiple white spaces in string to a single whitespace
PE5901_df['Description'] = PE5901_df['Description'].str.replace('\s+', ' ', regex=True)

# Strip leading and trailing spaces from fields to avoid groupby, concat, and merge issues later.
PE5901_df['Description'] = PE5901_df['Description'].str.strip()

# Groupby Account number and Asset name - sums individual rows with identical account numbers.
PE5901_df=PE5901_df.groupby(['Account','Description'],as_index=False).sum()

数据框

Here is one way to inspect the data in the Descriptions column. This would show if the issue is whitespace, or something else.

import pandas as pd

description = [
    '111001 cash deposit', '111001 cash deposit ', '111001 cash deposit  ',
    ' 111001 cash deposit', '  111001 cash deposit', '   111001 cash deposit',
]

elements = pd.Series(description).sort_values().unique()

for element in elements:
    print(f">>{element}<<")

Print-out is:

>>   111001 cash deposit<<
>>  111001 cash deposit<<
>> 111001 cash deposit<<
>>111001 cash deposit<<
>>111001 cash deposit <<
>>111001 cash deposit  <<

One can remove leading/trailing whitespace with the .str accessor:

elements = pd.Series(description).str.strip().sort_values().unique()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM