简体   繁体   English

如何在 Python Pandas 中的 DataFrame 中使用逗号分隔的重复行列中的值创建列?

[英]How to create column with values from column of duplicated rows separated by commas in DataFrame in Python Pandas?

I have Pandas DataFrame like below (data types of "ID" and "COL1" is "object"):我有 Pandas DataFrame 如下所示(“ID”和“COL1”的数据类型是“对象”):

ID  | COL1 | COL2 | COL3
----|------|------|----
123 | ABc  | 55   | G4
123 | Abc  | 55   | G4
123 | DD   | 55   | G4
44  | RoR  | 41   | P0
44  | RoR  | 41   | P0
55  | XX   | 456  | RR

And I need to:我需要:

  1. Create new column "COL1_cum" where will be all values from "COL1" per ID separated by commas创建新列“COL1_cum”,其中将是每个 ID 中的“COL1”中的所有值,以逗号分隔
  2. Drop duplicated IDs删除重复的 ID
  3. Create new column "COL1_num" where will be information how many different levels is in "COL1" per "ID"创建新列“COL1_num”,其中将提供每个“ID”在“COL1”中有多少不同级别的信息

So as a result I need something like below:因此,我需要以下内容:

ID  | COL1_cum | COL1_num |COL2 | COL3
----|----------|----------|-----|-----
123 | ABc, DD  | 2        | 55  | G4
44  | RoR      | 1        | 41  | P0
55  | XX       | 1        | 456 | RR

Explanation for COL1_num: COL1_num 的解释:

  • for ID = 123 COL1_num = 2 because for ID = 123 in "COL1" we have 2 different values: "ABc" and "DD"对于 ID = 123 COL1_num = 2 因为对于“COL1”中的 ID = 123,我们有 2 个不同的值:“ABc”和“DD”
  • for ID = 44 COL1_num = 1 because for ID = 44 in "COL1" we have 1 value: "RoR"对于 ID = 44 COL1_num = 1 因为对于“COL1”中的 ID = 44,我们有 1 个值:“RoR”
  • for ID = 55 COL1_num = 1 because for ID = 5 in "COL1" we have 1 value: "XX"对于 ID = 55 COL1_num = 1 因为对于“COL1”中的 ID = 5,我们有 1 个值:“XX”

How can I do that in Python Pandas?如何在 Python Pandas 中做到这一点?

If there are 2 columns in input data use DataFrame.drop_duplicates with aggregate join :如果输入数据中有 2 列,请使用DataFrame.drop_duplicates和聚合join

df1 = df.drop_duplicates().groupby('ID')['COL1'].agg(','.join).reset_index(name='COL1_cum')

If possible multiple columns is possible specify them:如果可能,可以指定多个列:

df1 = (df.drop_duplicates(['ID','COL1'])
         .groupby('ID')['COL1']
         .agg(','.join)
         .reset_index(name='COL1_cum'))

EDIT:编辑:

First remove duplciates per all columns:首先删除所有列的重复项:

df1 = df.drop_duplicates()
print (df1)
    ID COL1  COL2 COL3
0  123  ABc    55   G4
2  123   DD    55   G4
3   44  RoR    41   P0
5   55   XX   456   RR

Then aggregate join , size and get first values per another columns (because same values per groups ID ):然后聚合joinsize并获取每个其他列的第一个值(因为每个组ID的值相同):

df2 = (df1.groupby('ID', sort=False, as_index=False)
          .agg(COL1_cum =('COL1',','.join),
               COL1_num=('COL1','size'),
               COL2=('COL2','first'),
                COL3=('COL3','first')))
print (df2)
    ID COL1_cum  COL1_num  COL2 COL3
0  123   ABc,DD         2    55   G4
1   44      RoR         1    41   P0
2   55       XX         1   456   RR

EDIT2: Real data are not duplicated by all columns, possible solution: EDIT2:真实数据不会被所有列复制,可能的解决方案:

df2 = (df.groupby('ID', sort=False, as_index=False)
          .agg(COL1_cum =('COL1',lambda x: ','.join(dict.fromkeys(x))),
               COL1_num=('COL1','nunique'),
               COL2=('COL2','first'),
                COL3=('COL3','first')))
print (df2)
    ID COL1_cum  COL1_num  COL2 COL3
0  123   ABc,DD         2    55   G4
1   44      RoR         1    41   P0
2   55       XX         1   456   RR

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Pandas中,如何从一列中用逗号分隔的项目数创建数据框? - In Pandas, how do I create a dataframe from a count of items in a column that are separated by commas? Python:如何在以分号分隔的pandas数据帧的列中查找值? - Python: how to find values in a column of a pandas dataframe separated by semicolon? 如何为列表中的一个键创建具有多个值的 Python 字典,然后创建具有一列和多行的 pandas 数据框 - How can I create a Python dictionary with multiple values for one key from a list, to then create a pandas dataframe with one column and multiple rows Pandas DataFrame:根据不同行的值创建一列 - Pandas DataFrame : Create a column based on values from different rows 基于 Python Pandas 中列中的值突出显示 DataFrame 中的行 - Highlight rows from a DataFrame based on values in a column in Python Pandas 如何在 pandas dataframe 中插入重复的列,并从新列的值中删除最后 3 个数字? - How do I insert a duplicated column in a pandas dataframe with the last 3 numbers removed from the values of the new column? 如何从字典中创建一个pandas数据框,列名作为键,值作为行,其中值是二维数组 - how to create a pandas dataframe from a dictionary with column names as keys and values as rows where the values are 2-d array groupby逗号分隔值在单个DataFrame列python / pandas中 - groupby comma-separated values in single DataFrame column python/pandas 如何使用来自两行的值在 pandas dataframe 中创建列? - How do I create a column in a pandas dataframe using values from two rows? 如何使用熊猫中其他行和列的值和分组创建新的数据框列? - How to create a new dataframe column using values and groupings from other rows and columns in pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM