简体   繁体   English

如何在 CSV 文件中找到单词的频率?

[英]How do I find frequency of words in CSV file?

Total beginner here.这里是初学者。 I have a CSV file of around 20k tweets that I scraped (about a specific topic).我有一个 CSV 文件,其中包含我抓取的大约 20k 条推文(关于特定主题)。 I want to get a list of the most popular hashtags and the most tagged accounts.我想获取最受欢迎的主题标签和标签最多的帐户的列表。 When I tried looking for a solution online, most answers have code in which the words they want to find the frequency of are already known (ie find the frequency of the word "apple" in a string).当我尝试在线寻找解决方案时,大多数答案都有代码,其中他们想要查找频率的单词是已知的(即在字符串中查找单词“apple”的频率)。 However, I don't know what the most popular hashtag/tagged account is in this CSV and I want to grab a list of the top few.但是,我不知道这个 CSV 中最受欢迎的标签/标签帐户是什么,我想获取前几名的列表。

Basically what I want to do:基本上我想做的是:

  1. For each tweet (row in the CSV), find if "#" (or "@") appears.对于每条推文(CSV 中的行),查找是否出现“#”(或“@”)。
  2. If a # or @ appears, grab the word/phrase following the "#" or "@".如果出现# 或@,抓住“#”或“@”后面的单词/短语。
  3. Count the frequency of words/phrases that follow the #s or @s.计算#s 或@s 后面的单词/短语的频率。
  4. Sort list of #s or @s that appear most frequently in a tweet's content.对推文内容中出现频率最高的 #s 或 @s 进行排序。

For example, if the tweets were all about Baseball, maybe most of the tweets would have #mlb or tag @Yankees.例如,如果推文都是关于棒球的,也许大多数推文都会带有#mlb 或标签@Yankees。

Most popular hashtags:最受欢迎的标签:

  1. baseball棒球
  2. mlb毫升
  3. sports运动的

Most tagged accounts:标记最多的帐户:

  1. @mlb @mlb
  2. @baseball @棒球
  3. @yankees @yankees

One option I can think of is to use a dictionary.我能想到的一个选择是使用字典。 I would read in my CSV and then split it on spaces so I get individual words.我会在我的 CSV 中阅读,然后将其拆分为空格,以便获得单个单词。 Then check the first character of the word for # or @ and depending on which one it is I would use their respective dictionaries.然后检查单词的第一个字符是否为#@ ,根据是哪个字符,我将使用它们各自的字典。 If it's a # then check if the word already exists in the # dictionary and if it does then increment its value by 1, otherwise add it and give it a value of 1 (same goes for the @ dictionary).如果它是#则检查该单词是否已存在于#字典中,如果存在则将其值增加 1,否则将其添加并赋予其值 1( @字典也是如此)。

Here's sample code that outputs json file with counts of each hashtagh and tag found in the tweets.这是输出 json 文件的示例代码,其中包含在推文中找到的每个主题标签和标签的计数。 Split into hashtag and tag counts.分为主题标签和标签计数。

import pandas as pd
import re
import json


df = pd.read_csv("data.csv")
tweets = df["Tweet"]

counts = {"hashtags": [], "tags": []}
for tweet in tweets:
    hashtags = re.findall("(#\w+)", tweet)
    for hashtag in hashtags:
        if any(hashtag in item["text"] for item in counts["hashtags"]):
            for item in counts["hashtags"]:
                if item["text"] == hashtag:
                    item["count"] += 1
        else:
            counts["hashtags"].append({"text": hashtag, "count": 1})
    
    tags = re.findall("(@\w+)", tweet)
    for tag in tags:
        if any(tag in item["text"] for item in counts["tags"]):
            for item in counts["tags"]:
                if item["text"] == tag:
                    item["count"] += 1
        else:
            counts["tags"].append({"text": tag, "count": 1})


counts["hashtags"] = sorted(counts["hashtags"], key=lambda x: x["count"], reverse=True)
counts["tags"] = sorted(counts["tags"], key=lambda x: x["count"], reverse=True)


with open("counts.json", "w") as f:
    json.dump(counts, f, indent=4, sort_keys=True)

Sample data used使用的样本数据

Tweet
This is tweet 1 #life #thug @tweeter asdasd
This is tweet 2 #life #thug @tweeter qweqwe
This is tweet 3 #life#thug @tweeter asdasd @tweetking
This is tweet 4 #liferocks #life
This is tweet 5 #liferocks #earth
This is tweet 6 #liferocks #storm
This is tweet 7 #liferocks #fire
This is tweet 8 #nothing

No need to use a dictionary or go through each to do a tally or count.无需通过每个字典或 go 来进行计数或计数。 Pandas can do the count and sums by column. Pandas 可以按列进行计数和求和。

There's 2 ways you could do it given the data set Tsingis useds:鉴于 Tsingis 使用的数据集,有两种方法可以做到:

import pandas as pd导入 pandas 作为 pd

df = pd.DataFrame([
'This is tweet 1 #life #thug @tweeter asdasd',
'This is tweet 2 #life #thug @tweeter qweqwe',
'This is tweet 3 #life#thug @tweeter asdasd @tweetking',
'This is tweet 4 #liferocks #life',
'This is tweet 5 #liferocks #earth',
'This is tweet 6 #liferocks #storm',
'This is tweet 7 #liferocks #fire',
'This is tweet 8 #nothing'], columns=['Tweet'])

Now if we do:现在如果我们这样做:

print(df['Tweet'].str.count("#liferocks"))
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    0

It'll count how many times it shows up in each row.它会计算它在每一行中出现的次数。 So it's just a matter of taking the .sum()所以这只是采取.sum()的问题

print(df['Tweet'].str.count("#liferocks").sum())
4

However, slight problem with this is if we do:但是,如果我们这样做,则有一个小问题:

print(df['Tweet'].str.count("#life"))
0    1
1    1
2    1
3    2
4    1
5    1
6    1
7    0

Notice row index 3, we have a count of it twice, because it sees '#life 2 times with '#life' and '#liferocks' .注意行索引 3,我们对其进行了两次计数,因为它使用'#life''#liferocks'看到了'#life 2 次。

So we'll slightly modify the search/count to find the full hastags, meaning start at the hashtag and the specific word, and keep looking until you reach a space or another #.因此,我们将稍微修改搜索/计数以查找完整的标签,即从标签和特定单词开始,并继续查找,直到找到一个空格或另一个 #。 Then count those.然后数那些。

Code:代码:

import pandas as pd

df = pd.read_csv("data.csv")
tweets = df["Tweet"]

def count_exact_words(df, column_name, word):
    df['Tweet'] = df['Tweet'] + '.'
    return word, df['Tweet'].str.count(r'({word})[\s#.]'.format(word=word)).sum()

print(count_exact_words(df, 'Tweet', '#liferocks'))
print(count_exact_words(df, 'Tweet', '#life')) 

Output: Output:

('#liferocks', 4)
('#life', 4)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM