简体   繁体   English

熊猫-逗号分隔行中的每个字符串在数据帧中出现的频率

[英]Pandas - how often each string in comma separated rows is present within dataframe

I am working with DataFrame containing two columns, one of columns contains comma separated strings, second one contains integers. 我正在使用包含两列的DataFrame,其中一列包含逗号分隔的字符串,第二列包含整数。 I want to iterate through column with strings, save each unique string from each row, assign integer value from second column to each string. 我想用字符串遍历列,保存每行中的每个唯一字符串,将第二列中的整数值分配给每个字符串。 In other words, 换一种说法,

A           B
a,b,c,d     0
a,b,c,d     10
a,b,d,e     89
a,b,d,e     111

In this example: 在此示例中:

a = 220, b = 220, c = 10, d = 220, e = 210

I am selecting interesting columns from my csv file, 我正在从csv文件中选择有趣的列,

revcat = DataFrame(data, columns = ['Tag', 'Revenue']) 

This gives me ndarray with unique values in 'Tag' and transform it to another DataFrame. 这使ndarray在“ Tag”中具有唯一值,并将其转换为另一个DataFrame。

uniqtag = rev1.Tag.str.split(",").apply(pd.Series).stack().unique()
tag_stack = pd.DataFrame(uniqtag)

I am stuck here. 我被困在这里。 How, based on this, do I iterate through original 'Tag' column, using unique strings I found and sum values from 'Revenue' column to each 'Tag'? 在此基础上,我如何使用找到的唯一字符串遍历原始“标签”列,并将“收入”列中的值求和到每个“标签”?

You could do with Series.str.get_dummies , Series.mul and Series.sum : 您可以使用Series.str.get_dummiesSeries.mulSeries.sum

df['A'].str.get_dummies(sep=',').mul(df['B'], axis=0).sum()

a    210
b    210
c     10
d    210
e    200

Explanation 说明

df.A.str.get_dummies(sep=',')

This yields a DataFrame that looks like this: 这将产生一个如下所示的DataFrame:

   a  b  c  d  e
0  1  1  1  1  0
1  1  1  1  1  0
2  1  1  0  1  1
3  1  1  0  1  1

Then using .mul with your value column would yield: 然后在值列中使用.mul将产生:

     a    b   c    d    e
0    0    0   0    0    0
1   10   10  10   10    0
2   89   89   0   89   89
3  111  111   0  111  111

Which finally, applying .sum along index axis will give you your final output: 最后,沿索引轴应用.sum会给您最终输出:

a    210
b    210
c     10
d    210
e    200

Here are the steps I'd use 这是我要使用的步骤

  1. Split on "," and use expand=True to get a dataframe back where each letter is in its own column (I'm assuming right now, based on your example, that you always have the same number of splits? Is this true?) 在“,”上分割,并使用expand=True返回一个数据框,该数据框的每个字母都位于其自己的列中(根据您的示例,我现在假设您始终具有相同的分割数?这是真的吗? )

  2. "Melt" that dataframe so that instead of having multiple columns created from each row in the original df, you have a long dataframe where each row is a letter and its index in the original df. “融合”该数据框,以便在原始df中而不是从每行创建多个列,而是使用一个长数据框,其中每一行是一个字母,并且其索引在原始df中。

  3. Convert from the indices to the values in the B column 从索引转换为B列中的值

  4. Group by the letter and sum across B . 按字母分组并求和B

import pandas as pd

data = [
    ("a,b,c,d", 0),
    ("a,b,c,d", 10),
    ("a,b,d,e", 89),
    ("a,b,d,e", 111),
]
df = pd.DataFrame(data, columns=["A", "B"])

#   A       B
# 0 a,b,c,d 0
# 1 a,b,c,d 10
# 2 a,b,d,e 89
# 3 a,b,d,e 111

melted = df.A.str.split(",", expand=True).reset_index().melt(id_vars="index", value_name="A")
melted["B"] = df.B.loc[melted["index"]].values
melted.groupby("A").B.sum()

# value
# a    210
# b    210
# c    10
# d    210
# e    200

Note - I think you have the sums incorrect in the question; 注意-我认为您的问题中的总和不正确; a few of them seem to be off by 10. 他们中的一些人似乎到了10点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在熊猫数据框中以逗号分隔的字符串中的每一项添加+1 - Add +1 to each item in a comma-separated string in pandas dataframe 如何将逗号分隔的字典字符串拆分为Pandas数据框 - How to Split a comma separated string of dictionaries into a Pandas dataframe 如何将 Pandas 数据帧行转换为逗号分隔的字符串 - How to turn a pandas dataframe row into a comma separated string 在熊猫中将行转换为逗号分隔的字符串 - Convert rows into comma separated string in pandas 将 Pandas 数据框列的所有行转换为逗号分隔的值,每个值都用单引号 - Convert all rows of a Pandas dataframe column to comma-separated values with each value in single quote 通过逗号分隔的行的编号 Pandas DataFrame - Numbering for Rows generated through comma separated Pandas DataFrame 如何拆分括号内用逗号分隔的字符串 - How to split string separated by a comma within the bracket 在 Pandas Dataframe 中将字符串(逗号分隔)转换为 int 列表 - Convert string (comma separated) to int list in pandas Dataframe 正则表达式用 pandas dataframe 中的总和替换用逗号分隔的字符串 - Regular expression to replace string separated by comma with thier sum in pandas dataframe 将具有键值形式的逗号分隔字符串转换为 pandas Dataframe - Converting comma separated string with key-value form into pandas Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM