简体   繁体   English

元组列表中的元素计数,每一项具有增加的权重

[英]counting elements in a list of tuples with added weight per item

I have a list of tuples : 我有一个tuples list

for i, item in enumerate(tags_and_weights):
    tags = item[0]
    weight = item[1]

which prints: 打印:

1 (['alternative country', 'alternative pop', 'alternative rock', 'art rock', 'brill building pop', 'country rock', 'dance rock', 'experimental', 'folk', 'folk rock', 'garage rock', 'gbvfi', 'indie rock', 'jangle pop', 'lo-fi', 'melancholia', 'noise pop', 'post-punk', 'power pop', 'protopunk', 'psychedelic rock', 'pub rock', 'rock', 'roots rock', 'slow core'], 3)
2 (['funk', 'soul'], 4)
3 (['folk-pop', 'new americana'], 2)
4 ([], 4)
5 (['alternative pop', 'boston rock', 'lilith', 'melancholia'], 2)
6 (['acoustic pop', 'chamber pop', 'folk-pop', 'indie folk', 'indie pop', 'modern rock', 'neo mellow', 'new americana', 'stomp and holler'], 7)
7 (['slow core'], 3)
8 (['alternative rock', 'art rock', 'britpop', 'dance rock', 'electronic', 'madchester', 'new romantic', 'new wave', 'new wave pop', 'permanent wave', 'post-punk', 'rock', 'synthpop', 'uk post-punk'], 4)
9 (['funk', 'neo soul', 'soul'], 6)
10 (['blues-rock', 'classic rock', 'psychedelic rock', 'rock'], 2)

item[0] corresponds to a song (which has many tags associated to it). item[0]对应于一首歌曲 (具有与之关联的许多标签)。

item[1] corresponds to the count of song occurrences . item[1]对应于歌曲出现的次数

However, I need the total count by tag , and not by song. 但是,我需要标记总数 ,而不是歌曲总数

I now I can isolate flattened tags in a list, like so: 现在,我可以在列表中隔离展平的标签,如下所示:

def flatten(list):
    for sublist in list:
        for item in sublist:
            yield item

only_tags = [i[0] for i in tags_and_weights]
tags = list(flatten(only_tags))

and then, with pandas , quickly count them: 然后,使用pandas ,快速计数它们:

import pandas as pd
pd.Series(tags).value_counts()

but then I lose track of each tag weight...and total tag counts are misrepresented. 但随后我无法跟踪每个标签的重量...并且标签总数不正确。

Considering I'll be doing these calculations with much bigger lists, what is the most efficient way of counting all tags, keeping track of tag weight, and then multiply each count by it to get the final count by tag? 考虑到我将使用更大的列表进行这些计算,最有效的方法是对所有标签进行计数,跟踪标签重量,然后将每个计数与其相乘以得到标签的最终计数?

You can try: 你可以试试:

l = [(['alternative country', 'alternative pop', 'alternative rock', 'art rock', 'brill building pop', 'country rock', 'dance rock', 'experimental', 'folk', 'folk rock', 'garage rock', 'gbvfi', 'indie rock', 'jangle pop', 'lo-fi', 'melancholia', 'noise pop', 'post-punk', 'power pop', 'protopunk', 'psychedelic rock', 'pub rock', 'rock', 'roots rock', 'slow core'], 3)
,(['funk', 'soul'], 4)
,(['folk-pop', 'new americana'], 2)
,([], 4)
,(['alternative pop', 'boston rock', 'lilith', 'melancholia'], 2)
,(['acoustic pop', 'chamber pop', 'folk-pop', 'indie folk', 'indie pop', 'modern rock', 'neo mellow', 'new americana', 'stomp and holler'], 7)
,(['slow core'], 3)
,(['alternative rock', 'art rock', 'britpop', 'dance rock', 'electronic', 'madchester', 'new romantic', 'new wave', 'new wave pop', 'permanent wave', 'post-punk', 'rock', 'synthpop', 'uk post-punk'], 4)
,(['funk', 'neo soul', 'soul'], 6)
,(['blues-rock', 'classic rock', 'psychedelic rock', 'rock'], 2)]

tags, counts = zip(*l)

(pd.concat([pd.Series(counts[i], index=tags[i]) for i in range(len(tags))])
   .sum(level=0)
   .sort_values(ascending=False))

Using list comprehension with pd.concat and sum , after you unzip your list of tuples into two lists. 在将元组列表解压缩为两个列表后,将列表理解与pd.concatsum一起使用。

Output: 输出:

funk                   10
soul                   10
rock                    9
folk-pop                9
new americana           9
acoustic pop            7
indie folk              7
post-punk               7
dance rock              7
art rock                7
alternative rock        7
chamber pop             7
stomp and holler        7
neo mellow              7
modern rock             7
indie pop               7
slow core               6
neo soul                6
alternative pop         5
melancholia             5
psychedelic rock        5
britpop                 4
permanent wave          4
uk post-punk            4
synthpop                4
new wave pop            4
new wave                4
new romantic            4
madchester              4
electronic              4
brill building pop      3
gbvfi                   3
country rock            3
experimental            3
folk                    3
folk rock               3
garage rock             3
alternative country     3
indie rock              3
jangle pop              3
lo-fi                   3
noise pop               3
power pop               3
protopunk               3
pub rock                3
roots rock              3
blues-rock              2
boston rock             2
lilith                  2
classic rock            2
dtype: int64

Assuming that you created a DataFrame: 假设您创建了一个DataFrame:

d = [(['alternative country', ... # Your data
df = pd.DataFrame(data=d, columns=['tags', 'weight'])

one of possible solutions, using pure Pandas , without any list comprehensions is as follows: 使用纯Pandas而不需要任何列表理解的一种可能的解决方案如下:

df.tags.apply(pd.Series).stack().reset_index(level=1, drop=True)\
    .rename('tag').to_frame().join(df.weight).groupby('tag').sum()\
    .sort_values(['weight', 'tag'], ascending=[False, True])

For learning purpose, you can try consecutive steps as separate operations and look at results. 出于学习目的,您可以尝试将连续步骤作为单独的操作并查看结果。

Maybe an advantage is that tags are sorted, within groups with the same weight. 可能的优点是,标签以相同的权重在组内排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM