简体   繁体   English

从特定列中识别最常见的单词,但仅识别前 10 首歌曲(另一列)

[英]Identifying the most common words from a specific columns but only top 10 songs (another column)

I am having some trouble with this code.我在使用这段代码时遇到了一些麻烦。 I am supposed to retrieve the 20 most common words from the top 10 songs of each year (1965-2015) there is a rank so I feel like I can identify the top 10 with rank <= 10. But I am just so lost on how to even begin.我应该从每年(1965-2015)的前 10 首歌曲中检索 20 个最常见的单词有一个排名所以我觉得我可以识别排名 <= 10 的前 10 名。但我只是迷失了如何开始。 This is what I have so far.这是我到目前为止所拥有的。 I have not included the top 10 ranked songs yet.我还没有收录排名前 10 位的歌曲。 Also, the 20 most common words are coming from the lyrics column (which is 4)此外,最常用的 20 个词来自歌词栏(即 4)

import collections
import csv
import re

words = re.findall(r'\w+', open('billboard_songs.csv').read().lower())
reader = csv.reader(words, delimiter=',')
csvrow = [row[4] for row in reader]
most_common = collections.Counter(words[4]).most_common(20)
print(most_common)

the first lines from my file are as follow:我文件的第一行如下:

"Rank","Song","Artist","Year","Lyrics","Source"   
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs .....,3   

when it gets to 100 (rank) it starts again at 1 for the following year and etc.当它达到 100(排名)时,它会在下一年再次从 1 开始,依此类推。

You can use csv.DictReader to parse the file and get a usable Python list of dictionaries out of it.您可以使用csv.DictReader来解析文件并从中获取可用的 Python 字典列表。 Then, you can use for-comprehensions and itertools.groupby() to extract the song information you need.然后,您可以使用 for-comprehensions 和itertools.groupby()来提取您需要的歌曲信息。 Finally, you can use collections.Counter to find the most common words in the songs.最后,您可以使用collections.Counter来查找歌曲中最常用的单词。

#!/usr/bin/env python

import collections
import csv
import itertools


def analyze_songs(songs):
    # Grouping songs by year (groupby can only be used with a sorted list)
    sorted_songs = sorted(songs, key=lambda s: s["year"])
    for year, songs_iter in itertools.groupby(sorted_songs, key=lambda s: s["year"]):
        # Extract lyrics of top 10 songs
        top_10_songs_lyrics = [
            song["lyrics"] for song in songs_iter if song["rank"] <= 10
        ]

        # Join all lyrics together from all songs, and then split them into
        # a big list of words.
        top_10_songs_words = (" ".join(top_10_songs_lyrics)).split()

        # Using Counter to find the top 20 words
        most_common_words = collections.Counter(top_10_songs_words).most_common(20)

        print(f"Year {year}, most common words: {most_common_words}")


with open("billboard_songs.csv") as songs_file:
    reader = csv.DictReader(songs_file)

    # Transform the entries to a simpler format with appropriate types
    songs = [
        {
            "rank": int(row["Rank"]),
            "song": row["Song"],
            "artist": row["Artist"],
            "year": int(row["Year"]),
            "lyrics": row["Lyrics"],
            "source": row["Source"],
        }
        for row in reader
    ]

analyze_songs(songs)

In this answer, I assumed the following format for billboard_songs.csv :在这个答案中,我假设billboard_songs.csv的格式如下:

"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs","Source Evian"

I'm assuming the dataset is from 1965 to 2015 as explained in the question.如问题中所述,我假设数据集是从 1965 年到 2015 年。 If not, the list of songs should first be filtered accordingly.如果不是,则应首先相应地过滤歌曲列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何获取歌词歌曲中最常用的 50 个单词(Python) - How to get 50 most common words in lyrics songs (Python) 如何根据 python 中的另一个列值从列中找到最常见的值? - How can I find the most common value from a column based on another Columns value in python? 遍历字典并获得7个最常用的单词。 但是只有在另一个列表中找不到这些单词时 - Loop through dictionary and get the 7 most common words. BUT only if the words aren't found in another list 单词列表中最常见的 10 个字长 - top 10 most frequent wordlengths in a list of words 根据另一列中的值查找列中的常用词 - Finding common words in a column based on values from another column Numpy 从另一列中找到每个值最常见的项目 - Numpy finds most common item per value from another column 仅当特定列至少包含另一列的一个单词时,才从 Dataframe2 合并 Dataframe1 的 Python/Pandas 中的列 - Merge columns in Python/Pandas of Dataframe1 from Dataframe2 only if specific column contains at least one of the words of the other column 试图在路线列表中找到前 10 条最常见的路线 - Trying to find top 10 most common routes in a list of routes 如何在python的字典中计算前10个最常见的值 - How to count top 10 most common values in a dict in python 如何用 excel 中的另一个替换特定列中的特定单词 - How to replace specific words in specific column with another from excel
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM