简体   繁体   English

基于之前和之后单词的python单词分组

[英]python word grouping based on words before and after

I am trying create groups of words. 我正在尝试创建单词组。 First I am counting all words. 首先,我数所有单词。 Then I establish the top 10 words by word count. 然后按字数确定前10个字。 Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word. 然后,我想基于前10个单词创建10个单词组。每个组由位于该顶部单词之前和之后的所有单词组成。

I have survey results stored in a python pandas dataframe structured like this 我将调查结果存储在这样的python pandas数据框中

Question_ID | Customer_ID | Answer
  1           234         Data is very important to use because ... 
  2           234         We value data since we need it ... 

I also saved the answers column as a string. 我还将答案列另存为字符串。

I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column) 我正在使用以下代码在一个单词之前和之后找到3个单词(实际上我必须在“答案”列中创建一个字符串)

answers_str = df.Answer.apply(str)
for value in answers_str:
   non_data = re.split('data|Data', value)
   terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
   substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
   result = [' '.join(term) for term in substrs] # combine the terms back into substrings
   print result

I have been manually creating groups of words - but is there a way of doing it in python? 我一直在手动创建单词组-但是在python中有办法吗?

So based on the example shown above the group with word counts would look like this: 因此,根据上面显示的示例,带有字数统计功能的组如下所示:

group "data": 
              data : 2
              important: 1
              value: 1
              need:1

then when it goes through the whole file, there would be another group: 然后,当遍历整个文件时,将出现另一组:

group "analytics:
              analyze: 5
              report: 7
              list: 10
              visualize: 16

The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible. 这个想法也是要摆脱“我们”,“到”,“是”-但如果不可能的话,我可以手动完成。

Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words. 然后建立10个最常用的单词(按单词计数),然后创建10个单词组,这些单词在最主要的前10个单词的前面和后面。

We can use regex for this. 我们可以为此使用正则表达式。 We'll be using this regular expression 我们将使用此正则表达式

((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})

which you can test for yourself here , to extract the three words before and after each occurence of data 您可以在此处进行自我测试,以提取每次出现数据前后的三个词

First, let's remove all the words we don't like from the strings. 首先,让我们从字符串中删除所有我们不喜欢的单词。

import re

#    If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

The we want to get the words that surround data in each line 我们要获取每行中围绕数据的单词

data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)

gives us a list of tuples of strings. 给我们一个字符串元组列表。 We want to get a list of those strings after they are split. 我们希望在拆分后获得这些字符串的列表。

from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))

That's not pretty, but it works. 那不是很漂亮,但是可以。 Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list. 基本上,我们将元组从列表中拉出,将字符串从每个元组中拉出,然后拆分每个字符串,然后将所有字符串从列表中拉出,最后将它们分成一个大列表。

Let's put this all together with your pandas code. 让我们将其与您的pandas代码放在一起。 pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking. pandas不是我最擅长的领域,所以如果您看到怪异的事物,请不要以为我没有犯一些基本的错误。

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
    res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
    words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
    c.update(words)

The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. 关于我们使用的正则表达式的好处是,所有复杂的部分都不关心我们使用的是什么单词。 With a slight change, we can make a format string 稍作更改,我们就可以制作格式字符串

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'

such that 这样

base_pat.format('data') == data_pat

So with some list of words we want to collect information about key_words 因此,通过一些单词列表,我们希望收集有关key_words信息

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)


bad_words = ['we', 'is', 'to']

sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))

key_words = ['data', 'analytics']
d = {}

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
    key_pat = base_pat.format(keyword)
    c = Counter()
    for sentence in sentence_list:
        res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
        words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
        c.update(words)
    d[keyword] = c

Now we have a dictionary d that maps keywords, like data and analytics to Counter s that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. 现在,我们有了一个字典d ,它将dataanalytics类的关键字映射到Counter ,将不在我们的黑名单上的单词映射到它们在相关关键字附近的计数。 Something like 就像是

d= {'data'      : Counter({ 'important' : 2,
                            'very'      : 3}),
    'analytics' : Counter({ 'boring'    : 5,
                            'sleep'     : 3})
   }

As to how we get the top 10 words, that's basically the thing Counter is best at. 至于如何获得前10个字,基本上这是Counter最擅长的事情。

key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM