我正在尝试对包含多个值的特定列中的值进行分组？

Question

I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America.我有这个巨大的 netflix 数据集，我想看看哪些演员出现在美国最多的电影/电视节目中。 First, I created a list of unique actors from the dataset.首先，我从数据集中创建了一个独特演员列表。 Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long然后创建一个嵌套的 for 循环以遍历 list3 中的每个名称（包含检查 df3 中每一行的唯一参与者（过滤数据集有 2000+ 行），如果列转换包含来自 list3 的当前参与者名称。我相信使用 iterrows 需要太长时间

myDict1 = {}


for name in list3:
    if name not in myDict1:
        myDict1[name] = 0
    for index, row in df3.iterrows():
        if name in row["cast"]:
            myDict1[name] += 1
            
myDict1

Title标题	cast投掷
Movie1电影1	Robert De Niro, Al Pacino, Tarantino罗伯特·德尼罗、阿尔·帕西诺、塔伦蒂诺
Movie2电影2	Tom Hanks, Robert De Niro, Tom Cruise汤姆·汉克斯、罗伯特·德尼罗、汤姆·克鲁斯
Movie3电影3	Tom Cruise, Zendaya, Seth Rogen汤姆克鲁斯、赞达亚、塞斯罗根

I want my output to be like this:我希望我的 output 是这样的：

Name姓名	Count数数
Robert De Niro罗伯特·德尼罗	2 2
Tom Cruise汤姆·克鲁斯	2 2

Answer 1

Use利用

out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})


>>> out
             Name  Count
0      Tom Cruise      2
1  Robert De Niro      2
2         Zendaya      1
3      Seth Rogen      1
4       Tarantino      1
5       Al Pacino      1
6       Tom Hanks      1

Answer 2

l=['Robert De Niro','Tom Cruise']#list

df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns



              Name  Count
0  Robert De Niro      2
1      Tom Cruise      2

Answer 3

You could use collections.Counter to get the counts of the actors, after splitting the strings: 拆分字符串后，您可以使用collections.Counter来获取演员的数量：

from collections import Counter

pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(), 
             columns = ['Name', 'Count'])
 
             Name  Count
0  Robert De Niro      2
1       Al Pacino      1
2       Tarantino      1
3       Tom Hanks      1
4      Tom Cruise      2
5         Zendaya      1
6      Seth Rogen      1

If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:如果您热衷于速度，并且拥有大量数据，则可以将整个处理过程转储到普通的 python 中并重建 dataframe：

from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ") 
                                         for ent in df.cast)).items(), 
             columns = ['Name', 'Count'])

我正在尝试对包含多个值的特定列中的值进行分组？

问题描述

3 个解决方案

解决方案1
3 已采纳 2021-04-15 01:16:35

解决方案2
1 2021-04-15 01:29:49

解决方案3
1 2021-04-15 02:14:01

我正在尝试对包含多个值的特定列中的值进行分组？

问题描述

3 个解决方案

解决方案1 3 已采纳 2021-04-15 01:16:35

解决方案2 1 2021-04-15 01:29:49

解决方案3 1 2021-04-15 02:14:01

解决方案1
3 已采纳 2021-04-15 01:16:35

解决方案2
1 2021-04-15 01:29:49

解决方案3
1 2021-04-15 02:14:01