[英]How to aggregate unique substrings in a column of strings in Python?
I have a .csv
file as follows:我有一个
.csv
文件如下:
Alphabet![]() |
Sub alphabet![]() |
Value![]() |
Strings![]() |
---|---|---|---|
A![]() |
B![]() |
1 ![]() |
AA, AB ![]() |
A![]() |
C ![]() |
1 ![]() |
AA, AC ![]() |
A![]() |
E![]() |
2 ![]() |
AB, AD ![]() |
A![]() |
F ![]() |
3 ![]() |
AA, AD, AB ![]() |
D![]() |
B![]() |
1 ![]() |
AB, AC, AD ![]() |
D![]() |
C ![]() |
2 ![]() |
AA, AD ![]() |
D![]() |
E![]() |
2 ![]() |
AC, AD![]() |
D![]() |
F ![]() |
3 ![]() |
AD![]() |
Alphabet,Sub alphabet,Value,Strings
A,B,1,"AA, AB"
A,C,1,"AA, AC"
A,E,2,"AB, AD"
A,F,3,"AA, AD, AB"
D,B,1,"AB, AC, AD"
D,C,2,"AA, AD"
D,E,2,"AC, AD"
D,F,3,AD
I want it to return result like this:我希望它返回这样的结果:
Alphabet![]() |
Value![]() |
Frequency![]() |
% ![]() |
Strings![]() |
---|---|---|---|---|
A![]() |
1 ![]() |
2 ![]() |
50% ![]() |
AA, AB, AC, AD ![]() |
A![]() |
2 ![]() |
1 ![]() |
25% ![]() |
AA, AB, AC, AD ![]() |
A![]() |
3 ![]() |
1 ![]() |
25% ![]() |
AA, AB, AC, AD ![]() |
D![]() |
1 ![]() |
1 ![]() |
25% ![]() |
AB, AC, AD, AA ![]() |
D![]() |
2 ![]() |
2 ![]() |
50% ![]() |
AB, AC, AD, AA ![]() |
D![]() |
3 ![]() |
1 ![]() |
25% ![]() |
AB, AC, AD, AA ![]() |
Believably expected table above is self-explanatory.上面可信预期的表格是不言自明的。 The percentage refers to the corresponding row's frequency divided by total frequency.
百分比是指相应行的频率除以总频率。 String refers to the string of the corresponding alphabet row.
字符串是指对应字母表行的字符串。
My code:我的代码:
import pandas as pd
df = pd.read_csv("data.csv")
df = df.groupby(["Alphabet", "Value"], as_index=False).agg(Frequency=("Value", "count"))
df["%"] = df["Frequency"] / df.groupby("Alphabet")["Frequency"].transform("sum") * 100
df.to_csv("result.csv", index=None)
Feel free to leave a comment if you need more information.如果您需要更多信息,请随时发表评论。
How can I make such a.csv file?我怎样才能制作这样的 a.csv 文件? I would appreciate any help.
我将不胜感激任何帮助。 Thank you in advance!
先感谢您!
You can create the Strings
column you'd like by splitting the string values on ', ', using explode
to create separate rows for each unique value, and then selecting only the unique values with drop_duplicates
:您可以通过拆分 ', ' 上的字符串值来创建您想要的
Strings
列,使用explode
为每个唯一值创建单独的行,然后使用drop_duplicates
仅选择唯一值:
import pandas as pd
df_inp = pd.read_csv("data.csv")
df_out = df_inp.groupby(["Alphabet", "Value"], as_index=False).agg(Frequency=("Value", "count"))
df_out["%"] = df_out["Frequency"] / df_out.groupby("Alphabet")["Frequency"].transform("sum") * 100
df_str_vals = df_inp[['Alphabet', 'Strings']].assign(str_vals=lambda x: x['Strings'].str.split(', ')).explode('str_vals').drop(columns='Strings').drop_duplicates()
Then you can use groupby
to join the unique string values for each Alphabet
value back together:然后您可以使用
groupby
将每个Alphabet
值的唯一字符串值重新组合在一起:
df_str_vals = df_str_vals.groupby(["Alphabet"], as_index=False)['str_vals'].apply(', '.join).rename(columns={'str_vals': 'Strings'})
leading to this result:导致这个结果:
Finally, merge the df_str_vals
dataframe back with your earlier result to obtain the Strings
column for the output dataframe to write to the csv file:最后,将
df_str_vals
数据帧与您之前的结果合并,以获得输出数据帧的Strings
列以写入 csv 文件:
df_out = df_out.merge(df_str_vals, on='Alphabet')
df_out.to_csv("result.csv", index=None)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.