简体   繁体   中英

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.

I have survey results stored in a python pandas dataframe structured like this

Question_ID | Customer_ID | Answer
  1           234         Data is very important to use because ... 
  2           234         We value data since we need it ... 

I also saved the answers column as a string.

I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)

answers_str = df.Answer.apply(str)
for value in answers_str:
   non_data = re.split('data|Data', value)
   terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
   substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
   result = [' '.join(term) for term in substrs] # combine the terms back into substrings
   print result

I have been manually creating groups of words - but is there a way of doing it in python?

So based on the example shown above the group with word counts would look like this:

group "data": 
              data : 2
              important: 1
              value: 1
              need:1

then when it goes through the whole file, there would be another group:

group "analytics:
              analyze: 5
              report: 7
              list: 10
              visualize: 16

The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.

Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.

We can use regex for this. We'll be using this regular expression

((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})

which you can test for yourself here , to extract the three words before and after each occurence of data

First, let's remove all the words we don't like from the strings.

import re

#    If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

The we want to get the words that surround data in each line

data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)

gives us a list of tuples of strings. We want to get a list of those strings after they are split.

from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))

That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.

Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)

bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
    res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
    words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
    c.update(words)

The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'

such that

base_pat.format('data') == data_pat

So with some list of words we want to collect information about key_words

import re
from itertools import chain
from collections import Counter    

def remove_words(sentence, bad_words):
    pat = r'(?:{})'.format(r'|'.join(bad_words))
    return re.sub(pat, '', sentence, flags=re.IGNORECASE)


bad_words = ['we', 'is', 'to']

sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))

key_words = ['data', 'analytics']
d = {}

base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
    key_pat = base_pat.format(keyword)
    c = Counter()
    for sentence in sentence_list:
        res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
        words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
        c.update(words)
    d[keyword] = c

Now we have a dictionary d that maps keywords, like data and analytics to Counter s that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like

d= {'data'      : Counter({ 'important' : 2,
                            'very'      : 3}),
    'analytics' : Counter({ 'boring'    : 5,
                            'sleep'     : 3})
   }

As to how we get the top 10 words, that's basically the thing Counter is best at.

key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM