[英]I am trying to groupby values in a specific column that holds multiple values?
I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America.我有这个巨大的 netflix 数据集,我想看看哪些演员出现在美国最多的电影/电视节目中。 First, I created a list of unique actors from the dataset.
首先,我从数据集中创建了一个独特演员列表。 Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long
然后创建一个嵌套的 for 循环以遍历 list3 中的每个名称(包含检查 df3 中每一行的唯一参与者(过滤数据集有 2000+ 行),如果列转换包含来自 list3 的当前参与者名称。我相信使用 iterrows 需要太长时间
myDict1 = {}
for name in list3:
if name not in myDict1:
myDict1[name] = 0
for index, row in df3.iterrows():
if name in row["cast"]:
myDict1[name] += 1
myDict1
Title![]() |
cast![]() |
---|---|
Movie1![]() |
Robert De Niro, Al Pacino, Tarantino![]() |
Movie2![]() |
Tom Hanks, Robert De Niro, Tom Cruise![]() |
Movie3![]() |
Tom Cruise, Zendaya, Seth Rogen![]() |
I want my output to be like this:我希望我的 output 是这样的:
Name![]() |
Count![]() |
---|---|
Robert De Niro![]() |
2 ![]() |
Tom Cruise![]() |
2 ![]() |
Use利用
out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})
>>> out
Name Count
0 Tom Cruise 2
1 Robert De Niro 2
2 Zendaya 1
3 Seth Rogen 1
4 Tarantino 1
5 Al Pacino 1
6 Tom Hanks 1
l=['Robert De Niro','Tom Cruise']#list
df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns
Name Count
0 Robert De Niro 2
1 Tom Cruise 2
You could use collections.Counter to get the counts of the actors, after splitting the strings: 拆分字符串后,您可以使用collections.Counter来获取演员的数量:
from collections import Counter
pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(),
columns = ['Name', 'Count'])
Name Count
0 Robert De Niro 2
1 Al Pacino 1
2 Tarantino 1
3 Tom Hanks 1
4 Tom Cruise 2
5 Zendaya 1
6 Seth Rogen 1
If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:如果您热衷于速度,并且拥有大量数据,则可以将整个处理过程转储到普通的 python 中并重建 dataframe:
from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ")
for ent in df.cast)).items(),
columns = ['Name', 'Count'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.