熊猫-逗号分隔行中的每个字符串在数据帧中出现的频率

Question

I am working with DataFrame containing two columns, one of columns contains comma separated strings, second one contains integers. 我正在使用包含两列的DataFrame，其中一列包含逗号分隔的字符串，第二列包含整数。 I want to iterate through column with strings, save each unique string from each row, assign integer value from second column to each string. 我想用字符串遍历列，保存每行中的每个唯一字符串，将第二列中的整数值分配给每个字符串。 In other words, 换一种说法，

A           B
a,b,c,d     0
a,b,c,d     10
a,b,d,e     89
a,b,d,e     111

In this example: 在此示例中：

a = 220, b = 220, c = 10, d = 220, e = 210

I am selecting interesting columns from my csv file, 我正在从csv文件中选择有趣的列，

revcat = DataFrame(data, columns = ['Tag', 'Revenue'])

This gives me ndarray with unique values in 'Tag' and transform it to another DataFrame. 这使ndarray在“ Tag”中具有唯一值，并将其转换为另一个DataFrame。

uniqtag = rev1.Tag.str.split(",").apply(pd.Series).stack().unique()
tag_stack = pd.DataFrame(uniqtag)

I am stuck here. 我被困在这里。 How, based on this, do I iterate through original 'Tag' column, using unique strings I found and sum values from 'Revenue' column to each 'Tag'? 在此基础上，我如何使用找到的唯一字符串遍历原始“标签”列，并将“收入”列中的值求和到每个“标签”？

Answer 1

You could do with Series.str.get_dummies , Series.mul and Series.sum : 您可以使用Series.str.get_dummies ， Series.mul和Series.sum ：

df['A'].str.get_dummies(sep=',').mul(df['B'], axis=0).sum()

a    210
b    210
c     10
d    210
e    200

Explanation 说明

df.A.str.get_dummies(sep=',')

This yields a DataFrame that looks like this: 这将产生一个如下所示的DataFrame：

   a  b  c  d  e
0  1  1  1  1  0
1  1  1  1  1  0
2  1  1  0  1  1
3  1  1  0  1  1

Then using .mul with your value column would yield: 然后在值列中使用.mul将产生：

     a    b   c    d    e
0    0    0   0    0    0
1   10   10  10   10    0
2   89   89   0   89   89
3  111  111   0  111  111

Which finally, applying .sum along index axis will give you your final output: 最后，沿索引轴应用.sum会给您最终输出：

Answer 2

Here are the steps I'd use 这是我要使用的步骤

Split on "," and use expand=True to get a dataframe back where each letter is in its own column (I'm assuming right now, based on your example, that you always have the same number of splits? Is this true?) 在“，”上分割，并使用expand=True返回一个数据框，该数据框的每个字母都位于其自己的列中（根据您的示例，我现在假设您始终具有相同的分割数？这是真的吗？）
"Melt" that dataframe so that instead of having multiple columns created from each row in the original df, you have a long dataframe where each row is a letter and its index in the original df. “融合”该数据框，以便在原始df中而不是从每行创建多个列，而是使用一个长数据框，其中每一行是一个字母，并且其索引在原始df中。
Convert from the indices to the values in the B column 从索引转换为B列中的值
Group by the letter and sum across B . 按字母分组并求和B

import pandas as pd

data = [
    ("a,b,c,d", 0),
    ("a,b,c,d", 10),
    ("a,b,d,e", 89),
    ("a,b,d,e", 111),
]
df = pd.DataFrame(data, columns=["A", "B"])

#   A       B
# 0 a,b,c,d 0
# 1 a,b,c,d 10
# 2 a,b,d,e 89
# 3 a,b,d,e 111

melted = df.A.str.split(",", expand=True).reset_index().melt(id_vars="index", value_name="A")
melted["B"] = df.B.loc[melted["index"]].values
melted.groupby("A").B.sum()

# value
# a    210
# b    210
# c    10
# d    210
# e    200

Note - I think you have the sums incorrect in the question; 注意-我认为您的问题中的总和不正确； a few of them seem to be off by 10. 他们中的一些人似乎到了10点。

熊猫-逗号分隔行中的每个字符串在数据帧中出现的频率

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-04-09 15:46:28

Explanation 说明

解决方案2
1 2019-04-09 15:37:41

熊猫-逗号分隔行中的每个字符串在数据帧中出现的频率

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-04-09 15:46:28

Explanation 说明

解决方案2 1 2019-04-09 15:37:41

解决方案1
3 已采纳 2019-04-09 15:46:28

解决方案2
1 2019-04-09 15:37:41