简体   繁体   English

通过列中的标签列表对熊猫数据框行进行分组的有效方法

[英]Efficient way to group pandas dataframe rows by a list of tags in a column

Given a dataframe like:给定一个数据框,如:

df = pd.DataFrame(
        {
            'Movie':
            [
                'Star Trek',
                'Harry Potter',
                'Bohemian Rhapsody',
                'The Imitation Game',
                'The Avengers'
            ],
            'Genre':
            [
                'sci-fi; fiction',
                'fantasy; fiction; magic',
                'biography; drama; music',
                'biography; drama; thriller',
                'action; adventure; sci-fi'
            ]
        }
)

I'd like to group by the individual tags in the 'Genre' column and collect the movies as lists like:我想按“流派”列中的各个标签进行分组,并将电影收集为如下列表:

                                                 0
magic                               [Harry Potter]
sci-fi                   [Star Trek, The Avengers]
fiction                  [Star Trek, Harry Potter]
drama      [Bohemian Rhapsody, The Imitation Game]
fantasy                             [Harry Potter]
music                          [Bohemian Rhapsody]
thriller                      [The Imitation Game]
action                              [The Avengers]
biography  [Bohemian Rhapsody, The Imitation Game]
adventure                           [The Avengers]

My current code works, but I'd like to know if there are more efficient ways to do this.我当前的代码有效,但我想知道是否有更有效的方法来做到这一点。 Eg例如

  • not needing to convert between list, dataframe and dictionary,不需要在列表、数据框和字典之间进行转换,
  • not needing to use a for loop (perhaps something like groupby )不需要使用 for 循环(可能类似于groupby
genre = df['Genre'].apply(lambda x: str(x).split("; ")).tolist()
movie = df['Movie'].tolist()
data = dict()
for m,genres in zip(movie, genre):
    for g in genres:
        try:
            g_ = data[g]
        except:
            data[g] = [m]
        else:
            g_.append(m)

for key,value in data.items():
    data[key] = [data[key]]

output = pd.DataFrame.from_dict(data, orient='index')

It's easier when we first split the genres into a list当我们首先将流派分成列表时会更容易

df['Genre'] = df.Genre.str.split('; ')
df.explode('Genre').groupby('Genre')['Movie'].apply(list)

Output输出

action                                [The Avengers]
adventure                             [The Avengers]
biography    [Bohemian Rhapsody, The Imitation Game]
drama        [Bohemian Rhapsody, The Imitation Game]
fantasy                               [Harry Potter]
fiction                    [Star Trek, Harry Potter]
magic                                 [Harry Potter]
music                            [Bohemian Rhapsody]
sci-fi                     [Star Trek, The Avengers]
thriller                        [The Imitation Game]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM