I have a huge file with around 1 million rows and 4 columns. The columns that I want to analysis are A and C. The names in column A are repeating more than one time, but each time matched with a unique name in column C. I am looking for 4 specific names in column C and I want their correspond name in column A. I want to have all the names from column A with any combination of the 4 names from column C and also want to count them how many of each combination there are. I know it is confusing I show in an example:
Original file: I am looking for TI,NB,CC and LR in column C and their correspond name in column A.
A B C D
GB1 TI
GB2 NB
GB3 VH
GB1 NB
GB2 CC
GB6 TI
GB1 LR
GB1 CC
GB8 JK
GB9 TI
Results That I want:
Name: Name from column C:
GB1 TI, NB,LR,CC
GB2 NB,CC
GB6 TI
GB9 TI
Also I want to know how many of each combination there are:(around 20 possible combination)
Combination: Number:
TI,NB,LR,CC 1
NB,CC 1
TI 2
Thank you,
To find out all the combinations, you can group data frame by A
and join all items from column C
after sorting the items (for the count of combination purpose); To find out how many combinations there are, you can do a value_counts() :
items = ["TI", "NB", "CC", "LR"]
# use isin method to filter the data frame so that the results only contain interested items
# ignore the sort_values here if the order of the combination matters here
df1 = df[df.C.isin(items)].groupby("A").C.apply(lambda g: ','.join(g.sort_values()))
df1
#A
#GB1 CC,LR,NB,TI
#GB2 CC,NB
#GB6 TI
#GB9 TI
#Name: C, dtype: object
df1.value_counts()
#TI 2
#CC,LR,NB,TI 1
#CC,NB 1
#Name: C, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.