简体   繁体   English

如何在 Python 中聚合一列字符串中的唯一子字符串?

[英]How to aggregate unique substrings in a column of strings in Python?

I have a .csv file as follows:我有一个.csv文件如下:

Alphabet字母 Sub alphabet子字母表 Value价值 Strings字符串
A一种 B 1 1个 AA, AB AA, AB
A一种 C C 1 1个 AA, AC AA, 交流电
A一种 E 2 2个 AB, AD AB, 广告
A一种 F F 3 3个 AA, AD, AB AA、AD、AB
D B 1 1个 AB, AC, AD AB、AC、AD
D C C 2 2个 AA, AD AA, 广告
D E 2 2个 AC, AD交流电、广告
D F F 3 3个 AD广告
Alphabet,Sub alphabet,Value,Strings
A,B,1,"AA, AB"
A,C,1,"AA, AC"
A,E,2,"AB, AD"
A,F,3,"AA, AD, AB"
D,B,1,"AB, AC, AD"
D,C,2,"AA, AD"
D,E,2,"AC, AD"
D,F,3,AD

I want it to return result like this:我希望它返回这样的结果:

Alphabet字母 Value价值 Frequency频率 % % Strings字符串
A一种 1 1个 2 2个 50% 50% AA, AB, AC, AD AA、AB、AC、AD
A一种 2 2个 1 1个 25% 25% AA, AB, AC, AD AA、AB、AC、AD
A一种 3 3个 1 1个 25% 25% AA, AB, AC, AD AA、AB、AC、AD
D 1 1个 1 1个 25% 25% AB, AC, AD, AA AB、AC、AD、AA
D 2 2个 2 2个 50% 50% AB, AC, AD, AA AB、AC、AD、AA
D 3 3个 1 1个 25% 25% AB, AC, AD, AA AB、AC、AD、AA

Believably expected table above is self-explanatory.上面可信预期的表格是不言自明的。 The percentage refers to the corresponding row's frequency divided by total frequency.百分比是指相应行的频率除以总频率。 String refers to the string of the corresponding alphabet row.字符串是指对应字母表行的字符串。

My code:我的代码:

import pandas as pd

df = pd.read_csv("data.csv")
df = df.groupby(["Alphabet", "Value"], as_index=False).agg(Frequency=("Value", "count"))
df["%"] = df["Frequency"] / df.groupby("Alphabet")["Frequency"].transform("sum") * 100
df.to_csv("result.csv", index=None)

Feel free to leave a comment if you need more information.如果您需要更多信息,请随时发表评论。

How can I make such a.csv file?我怎样才能制作这样的 a.csv 文件? I would appreciate any help.我将不胜感激任何帮助。 Thank you in advance!先感谢您!

You can create the Strings column you'd like by splitting the string values on ', ', using explode to create separate rows for each unique value, and then selecting only the unique values with drop_duplicates :您可以通过拆分 ', ' 上的字符串值来创建您想要的Strings列,使用explode为每个唯一值创建单独的行,然后使用drop_duplicates仅选择唯一值:

import pandas as pd

df_inp = pd.read_csv("data.csv")
df_out = df_inp.groupby(["Alphabet", "Value"], as_index=False).agg(Frequency=("Value", "count"))
df_out["%"] = df_out["Frequency"] / df_out.groupby("Alphabet")["Frequency"].transform("sum") * 100

df_str_vals = df_inp[['Alphabet', 'Strings']].assign(str_vals=lambda x: x['Strings'].str.split(', ')).explode('str_vals').drop(columns='Strings').drop_duplicates()

Then you can use groupby to join the unique string values for each Alphabet value back together:然后您可以使用groupby将每个Alphabet值的唯一字符串值重新组合在一起:

df_str_vals = df_str_vals.groupby(["Alphabet"], as_index=False)['str_vals'].apply(', '.join).rename(columns={'str_vals': 'Strings'})

leading to this result:导致这个结果:

在此处输入图像描述

Finally, merge the df_str_vals dataframe back with your earlier result to obtain the Strings column for the output dataframe to write to the csv file:最后,将df_str_vals数据帧与您之前的结果合并,以获得输出数据帧的Strings列以写入 csv 文件:

df_out = df_out.merge(df_str_vals, on='Alphabet')
df_out.to_csv("result.csv", index=None)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM