简体   繁体   中英

Matching and counting combination in python/panda

I have a huge file with around 1 million rows and 4 columns. The columns that I want to analysis are A and C. The names in column A are repeating more than one time, but each time matched with a unique name in column C. I am looking for 4 specific names in column C and I want their correspond name in column A. I want to have all the names from column A with any combination of the 4 names from column C and also want to count them how many of each combination there are. I know it is confusing I show in an example:

Original file: I am looking for TI,NB,CC and LR in column C and their correspond name in column A.

    A                B           C         D
   GB1                          TI
   GB2                          NB
   GB3                          VH
   GB1                          NB
   GB2                          CC
   GB6                          TI
   GB1                          LR
   GB1                          CC
   GB8                          JK
   GB9                          TI

Results That I want:

 Name:         Name from column C:
  GB1          TI, NB,LR,CC
  GB2          NB,CC
  GB6          TI
  GB9          TI

Also I want to know how many of each combination there are:(around 20 possible combination)

Combination:          Number:
TI,NB,LR,CC             1 
NB,CC                   1
TI                      2

Thank you,

To find out all the combinations, you can group data frame by A and join all items from column C after sorting the items (for the count of combination purpose); To find out how many combinations there are, you can do a value_counts() :

items = ["TI", "NB", "CC", "LR"]
# use isin method to filter the data frame so that the results only contain interested items
# ignore the sort_values here if the order of the combination matters here
df1 = df[df.C.isin(items)].groupby("A").C.apply(lambda g: ','.join(g.sort_values()))
df1

#A
#GB1    CC,LR,NB,TI
#GB2          CC,NB
#GB6             TI
#GB9             TI
#Name: C, dtype: object

df1.value_counts()

#TI             2
#CC,LR,NB,TI    1
#CC,NB          1
#Name: C, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM