简体   繁体   English

Pandas Dataframe 未将字符串识别为相同的分组

[英]Pandas Dataframe not recognizing strings as identical for grouping

I have a dataset with several hundred Account numbers and their Descriptions.我有一个包含数百个帐号及其描述的数据集。 It has been imported from Excel into a Python dataframe.它已从 Excel 导入到 Python dataframe。 The Descriptions, in Excel, have varying numbers of leading and trailing white spaces. Excel 中的描述具有不同数量的前导和尾随空格。 The Account number is an integer, Description is an object, and End Balance is a float.帐号为 integer,描述为 object,最终余额为浮点数。

I've tried stripping leading and trailing spaces, replacing multiple white space with single but when I use groupby it does not recognize the Descriptions as identical.我尝试去除前导和尾随空格,用单个替换多个空格,但是当我使用 groupby 时,它不会将描述识别为相同。 If I only groupby Account I get 435 rows, which is correct.如果我只 groupby Account 我得到 435 行,这是正确的。 If I groupby Description I get over 1100 which is not correct (that's the original number of rows).如果我 groupby 描述我得到超过 1100 这是不正确的(这是原始的行数)。 Grouping by Account and Description yields same result as grouping by Description.按帐户和描述分组产生与按描述分组相同的结果。 This implies to me that the Descriptions are still not seen as identical.这对我来说意味着描述仍然不被视为相同。

I've also tried not stripping at all and leaving as original with no joy.我也尝试过完全不脱衣服,然后毫无喜悦地离开。

Any thoughts of how to make the Descriptions identical?关于如何使描述相同的任何想法?

# Replaces multiple white spaces in string to a single whitespace
PE5901_df['Description'] = PE5901_df['Description'].str.replace('\s+', ' ', regex=True)

# Strip leading and trailing spaces from fields to avoid groupby, concat, and merge issues later.
PE5901_df['Description'] = PE5901_df['Description'].str.strip()

# Groupby Account number and Asset name - sums individual rows with identical account numbers.
PE5901_df=PE5901_df.groupby(['Account','Description'],as_index=False).sum()

数据框

Here is one way to inspect the data in the Descriptions column.这是检查“描述”列中数据的一种方法。 This would show if the issue is whitespace, or something else.这将显示问题是空格还是其他问题。

import pandas as pd

description = [
    '111001 cash deposit', '111001 cash deposit ', '111001 cash deposit  ',
    ' 111001 cash deposit', '  111001 cash deposit', '   111001 cash deposit',
]

elements = pd.Series(description).sort_values().unique()

for element in elements:
    print(f">>{element}<<")

Print-out is:打印输出为:

>>   111001 cash deposit<<
>>  111001 cash deposit<<
>> 111001 cash deposit<<
>>111001 cash deposit<<
>>111001 cash deposit <<
>>111001 cash deposit  <<

One can remove leading/trailing whitespace with the .str accessor:可以使用.str访问器删除前导/尾随空格:

elements = pd.Series(description).str.strip().sort_values().unique()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM