简体   繁体   English

计算评论中大量名词和动词/形容词的所有共同出现

[英]Counting all co-occurrences of a large list of nouns and verbs/adjectives within reviews

I have a dataframe that contains a large number of reviews, a large list with noun words (1000) and another large list with verbs/adjectives (1000).我有一个 dataframe 包含大量评论,一个带有名词词的大列表(1000)和另一个带有动词/形容词的大列表(1000)。

Example dataframe and lists:示例 dataframe 并列出:

import pandas as pd

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

I want to create a dictionary of dictionaries to store all the co-occurrences of nouns and verbs/adjectives in each review, eg我想创建一个字典来存储每个评论中名词和动词/形容词的所有共现,例如

'Very professional operation. '很专业的操作。 Room is very clean and comfortable.'房间非常干净舒适。

{'room': {'is': 1, 'clean': 1, 'comfortable': 1}

Using the following code:使用以下代码:

def count_co_occurences(reviews):
    # Iterate on each review and count
    occurences_per_review = {
        f"review_{i+1}": {
            noun: dict(Counter(review.lower().split(" ")))
            for noun in nouns
            if noun in review.lower()
        }
        for i, review in enumerate(reviews)
    }
    # Remove verb_adj not found in main list
    opr = deepcopy(occurences_per_review)
    for review, occurences in opr.items():
        for noun, counts in occurences.items():
            for verb_adj in counts.keys():
                if verb_adj not in verbs_adj:
                    del occurences_per_review[review][noun][verb_adj]
                    
    return occurences_per_review

pprint(count_co_occurences(data["reviews"]))

Works for when the lists and the number of reviews are small, but my notebook crashes when this function is used on large lists/large no.适用于列表和评论数量很少的情况,但是当此 function 用于大型列表/大型编号时,我的笔记本会崩溃。 of reviews.的评论。 How can I modify the code in order to handle this?如何修改代码以处理此问题?

I think you may need to use a couple of libraries to make your life easier.我认为您可能需要使用几个库来让您的生活更轻松。 In this example I'm using nltk and collections, apart from pandas of course:在这个例子中,我使用 nltk 和 collections,当然除了 pandas:

import pandas as pd
import nltk
from collections import Counter

data = {'reviews':['Very professional operation. Room is very clean and comfortable',
                    'Daniel is the most amazing host! His place is extremely clean, and he provides everything you could possibly want (comfy bed, guidebooks & maps, mini-fridge, towels, even toiletries). He is extremely friendly and helpful.',
                    'The room is very quiet, and well decorated, very clean.',
                    'He provides the room with towels, tea, coffee and a wardrobe.',
                    'Daniel is a great host. Always recomendable.',
                    'My friend and I were very satisfied with our stay in his apartment.']}

df = pd.DataFrame(data)

nouns = ['place','Amsterdam','apartment','location','host','stay','city','room','everything','time','house',
         'area','home','’','center','restaurants','centre','Great','tram','très','minutes','walk','space','neighborhood',
         'à','station','bed','experience','hosts','Thank','bien']

verbs_adj = ['was','is','great','nice','had','clean','were','recommend','stay','are','good','perfect','comfortable',
             'have','easy','be','quiet','helpful','get','beautiful',"'s",'has','est','located','un','amazing','wonderful',]

def buildict(x):
    occurdict={}
    tokens = nltk.word_tokenize(x)
    tokenslower = list(map(str.lower, tokens)) 
    allnouns=[word for word in tokenslower if word in nouns]
    allverbs_adj=Counter(word for word in tokenslower if word in verbs_adj)
    for noun in allnouns:
        occurdict[noun]=dict(allverbs_adj)
    return occurdict

df['words']=df['reviews'].apply(lambda x: buildict(x))

output: output:

0   Very professional operation. Room is very clea...   {'room': {'is': 1, 'clean': 1, 'comfortable': 1}}
1   Daniel is the most amazing host! His place is ...   {'host': {'is': 3, 'amazing': 1, 'clean': 1, '...
2   The room is very quiet, and well decorated, ve...   {'room': {'is': 1, 'quiet': 1, 'clean': 1}}
3   He provides the room with towels, tea, coffee ...   {'room': {}}
4   Daniel is a great host. Always recomendable.    {'host': {'is': 1, 'great': 1}}
5   My friend and I were very satisfied with our s...   {'stay': {'were': 1, 'stay': 1}, 'apartment': ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM