[英]Identifying the most common words from a specific columns but only top 10 songs (another column)
I am having some trouble with this code.我在使用这段代码时遇到了一些麻烦。 I am supposed to retrieve the 20 most common words from the top 10 songs of each year (1965-2015) there is a rank so I feel like I can identify the top 10 with rank <= 10. But I am just so lost on how to even begin.我应该从每年(1965-2015)的前 10 首歌曲中检索 20 个最常见的单词有一个排名所以我觉得我可以识别排名 <= 10 的前 10 名。但我只是迷失了如何开始。 This is what I have so far.这是我到目前为止所拥有的。 I have not included the top 10 ranked songs yet.我还没有收录排名前 10 位的歌曲。 Also, the 20 most common words are coming from the lyrics column (which is 4)此外,最常用的 20 个词来自歌词栏(即 4)
import collections
import csv
import re
words = re.findall(r'\w+', open('billboard_songs.csv').read().lower())
reader = csv.reader(words, delimiter=',')
csvrow = [row[4] for row in reader]
most_common = collections.Counter(words[4]).most_common(20)
print(most_common)
the first lines from my file are as follow:我文件的第一行如下:
"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs .....,3
when it gets to 100 (rank) it starts again at 1 for the following year and etc.当它达到 100(排名)时,它会在下一年再次从 1 开始,依此类推。
You can use csv.DictReader
to parse the file and get a usable Python list of dictionaries out of it.您可以使用csv.DictReader
来解析文件并从中获取可用的 Python 字典列表。 Then, you can use for-comprehensions and itertools.groupby()
to extract the song information you need.然后,您可以使用 for-comprehensions 和itertools.groupby()
来提取您需要的歌曲信息。 Finally, you can use collections.Counter
to find the most common words in the songs.最后,您可以使用collections.Counter
来查找歌曲中最常用的单词。
#!/usr/bin/env python
import collections
import csv
import itertools
def analyze_songs(songs):
# Grouping songs by year (groupby can only be used with a sorted list)
sorted_songs = sorted(songs, key=lambda s: s["year"])
for year, songs_iter in itertools.groupby(sorted_songs, key=lambda s: s["year"]):
# Extract lyrics of top 10 songs
top_10_songs_lyrics = [
song["lyrics"] for song in songs_iter if song["rank"] <= 10
]
# Join all lyrics together from all songs, and then split them into
# a big list of words.
top_10_songs_words = (" ".join(top_10_songs_lyrics)).split()
# Using Counter to find the top 20 words
most_common_words = collections.Counter(top_10_songs_words).most_common(20)
print(f"Year {year}, most common words: {most_common_words}")
with open("billboard_songs.csv") as songs_file:
reader = csv.DictReader(songs_file)
# Transform the entries to a simpler format with appropriate types
songs = [
{
"rank": int(row["Rank"]),
"song": row["Song"],
"artist": row["Artist"],
"year": int(row["Year"]),
"lyrics": row["Lyrics"],
"source": row["Source"],
}
for row in reader
]
analyze_songs(songs)
In this answer, I assumed the following format for billboard_songs.csv
:在这个答案中,我假设billboard_songs.csv
的格式如下:
"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs","Source Evian"
I'm assuming the dataset is from 1965 to 2015 as explained in the question.如问题中所述,我假设数据集是从 1965 年到 2015 年。 If not, the list of songs should first be filtered accordingly.如果不是,则应首先相应地过滤歌曲列表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.