按主题提取关键词

Question

I have a structured dataset with columns 'text' and 'topic'. 我有一个带有列“文本”和“主题”的结构化数据集。 Someone has already conducted a word embedding/topic modeling so each row in 'text' is assigned a topic number (1-200). 有人已经进行了词嵌入/主题建模，因此“文本”中的每一行都分配了一个主题编号（1-200）。 I would like to create a new data frame with the topic number and the top 5-10 key words that represent that topic. 我想创建一个新的数据框架，其中包含主题编号和代表该主题的前5-10个关键字。

I've done this before, but I usually start from scratch and run an LDA model. 我之前已经做过，但是我通常是从头开始并运行LDA模型。 Then use the objects created by the LDA to find keywords per topic. 然后，使用LDA创建的对象按主题查找关键字。 That said, I'm starting from a mid-point that my supervisor gave me, and it's throwing me off. 就是说，我是从上司给我的中点开始的，这让我很失望。

The data structure looks like below: 数据结构如下所示：

import pandas as pd
df = pd.DataFrame({'text': ['foo bar baz', 'blah bling', 'foo'], 
               'topic': [1, 2, 1]})

So would the plan be to create a bag of words, groupby 'topic,' and count the words? 那么计划是创建一个由“主题”分组的单词袋并计算单词数吗？ Or is there a keywords function and group by a column option that I don't know about in gensim or nltk? 还是在gensim或nltk中不存在关键字函数和按列选项分组？

Answer 1

I have created a dictionary where keys are the topic and text is the string of words appending each topic's words. 我创建了一个字典，其中的键是主题，文本是在每个主题的词之后附加的词的字符串。

d = dict()
for index, ser in df.iterrows():
    print(index, df.loc[index]['text'])
    topic  = df.loc[index]['topic']
    if topic not in d.keys():
        d[df.loc[index]['topic']] = ""
    d[df.loc[index]['topic']] += ( df.loc[index]['text']) + " "

print(d)
#Output
{1: 'foo bar baz foo ', 2: 'blah bling '}

Then I have used the Counter package to get frequency of words for each topic. 然后，我使用了Counter包来获取每个主题的词频。

from collections import Counter
for key in d.keys():
    print(Counter(d[key].split()))

#Output
Counter({'foo': 2, 'baz': 1, 'bar': 1})
Counter({'blah': 1, 'bling': 1})

Answer 2

I think this works: 我认为这可行：

test = pd.DataFrame(df.groupby("topic")['document'].apply(lambda documents: ''.join(str(documents))))

from nltk import Metric, Rake

r = Rake(ranking_metric= Metric.DEGREE_TO_FREQUENCY_RATIO, language= 'english', min_length=1, max_length=4)

r.extract_keywords_from_text(test.document[180])
r.get_ranked_phrases()

I just need to figure out how to loop in through for each topic and append it to a dataframe. 我只需要弄清楚如何遍历每个主题并将其附加到数据框。

按主题提取关键词

问题描述

2 个解决方案

解决方案1
1 2019-06-27 14:39:23

解决方案2
0 2019-06-27 15:40:06

按主题提取关键词

问题描述

2 个解决方案

解决方案1 1 2019-06-27 14:39:23

解决方案2 0 2019-06-27 15:40:06

解决方案1
1 2019-06-27 14:39:23

解决方案2
0 2019-06-27 15:40:06