简体   繁体   English

我正在尝试对包含多个值的特定列中的值进行分组?

[英]I am trying to groupby values in a specific column that holds multiple values?

I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America.我有这个巨大的 netflix 数据集,我想看看哪些演员出现在美国最多的电影/电视节目中。 First, I created a list of unique actors from the dataset.首先,我从数据集中创建了一个独特演员列表。 Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long然后创建一个嵌套的 for 循环以遍历 list3 中的每个名称(包含检查 df3 中每一行的唯一参与者(过滤数据集有 2000+ 行),如果列转换包含来自 list3 的当前参与者名称。我相信使用 iterrows 需要太长时间

myDict1 = {}


for name in list3:
    if name not in myDict1:
        myDict1[name] = 0
    for index, row in df3.iterrows():
        if name in row["cast"]:
            myDict1[name] += 1
            
myDict1
Title标题 cast投掷
Movie1电影1 Robert De Niro, Al Pacino, Tarantino罗伯特·德尼罗、阿尔·帕西诺、塔伦蒂诺
Movie2电影2 Tom Hanks, Robert De Niro, Tom Cruise汤姆·汉克斯、罗伯特·德尼罗、汤姆·克鲁斯
Movie3电影3 Tom Cruise, Zendaya, Seth Rogen汤姆克鲁斯、赞达亚、塞斯罗根

I want my output to be like this:我希望我的 output 是这样的:

Name姓名 Count数数
Robert De Niro罗伯特·德尼罗 2 2
Tom Cruise汤姆·克鲁斯 2 2

Use利用

out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})


>>> out
             Name  Count
0      Tom Cruise      2
1  Robert De Niro      2
2         Zendaya      1
3      Seth Rogen      1
4       Tarantino      1
5       Al Pacino      1
6       Tom Hanks      1
l=['Robert De Niro','Tom Cruise']#list

df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns



              Name  Count
0  Robert De Niro      2
1      Tom Cruise      2

You could use collections.Counter to get the counts of the actors, after splitting the strings: 拆分字符串后,您可以使用collections.Counter来获取演员的数量:

from collections import Counter

pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(), 
             columns = ['Name', 'Count'])
 
             Name  Count
0  Robert De Niro      2
1       Al Pacino      1
2       Tarantino      1
3       Tom Hanks      1
4      Tom Cruise      2
5         Zendaya      1
6      Seth Rogen      1

If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:如果您热衷于速度,并且拥有大量数据,则可以将整个处理过程转储到普通的 python 中并重建 dataframe:

from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ") 
                                         for ent in df.cast)).items(), 
             columns = ['Name', 'Count'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM