简体   繁体   English

计算每个用户的词频

[英]Counting word frequency of each individual user

I am trying to generate statistics for the users word frequency as they have given in their reviews like 我正在尝试为用户在评论中给出的词频生成统计信息,例如

user 1: word frequencies 用户1:单词频率

user 2: word frequencies and so on... 用户2:字词频率等...

How can I do that? 我怎样才能做到这一点?

Here I am trying to access review of each user but it gives me an error. 在这里,我尝试访问每个用户的评论,但这给我一个错误。

Please suggest the approach and sudo code. 请提出方法和sudo代码。

import json
from pprint import pprint

file = open('/Users/mack/Downloads/WKA/task/reviews.json','r')
content = file.read()
file = json.loads(content)

for eid, txt in file["id"]["text"]:
    print(eid, txt)

A big json like this: 像这样的大json:

[
    {
       "id": 1,
       "text": "Bought this over a month ago and everything came like advertise. I got the purple cover and it looks wonderful. The outlet works just fine and charges my kindle without a problem. I also bought it on sale so it was $20 cheaper. Best. Deal. Ever. Love my kindle paperwhite (love being able to read in the dark too!) Also makes reading at work much easier than a traditional book. Thanks Amazon.",
    },
    {
       "id": 2,
       "text": "Why three stars? Skip the next two paragraphs. Purchased the bundle on Black Friday - great price. The device works as advertised and I'm enjoying it. However, the lighting (even on max) is underwhelming. The features are handy and easy to use (i.e. dictionary, highlighting, bookmark, etc.) The case is attractive and sturdy enough, but the magnetic closure is rather weak. I suspect the case would open easily if the device were dropped.. In retrospect, I probably would have been dollars ahead to purchase a less expensive case separately rather than bundling. The reason for the three (3) stars? The promoted $15 credit towards purchase of ebook(s). After two unsuccessful attempts to redeem the credit and visiting with an Amazon rep, it appears the credit only works for Amazon digital/published books and is NOT applicable to third party publisher/sellers such as HarperCollins, Random House, Simon and Schuster, Penguin, Tyndale, Scholastic, Thomas Nelson, etc. etc. After respectfully telling the rep that this promotion seems very misleading and asking where I could find a list of authors and/or books for which the credit is applicable, he could offer no such list or database. He suggested finding an author on the Amazon ebook list, clicking on a title, putting the book into the order box and then noting the publisher in the order box. If it didn't say Amazon, I would know the credit could not be applied. I have since located several of my favorite writers and pulled up many of their ebooks. As I expected, NONE were available for purchase with the credit. ALL were published by major publishing houses. NONE were published by Amazon digital. I cannot imagine any prolific author of note not being affiliated with major publishing houses - which leaves the enticing ebook credit pretty much useless to me. The language in the Terms and Conditions seems vague at best regarding this restriction. This lack of clarity gives the consumer little, if any, pause regarding the use of the credit. After trying to use it, I felt like I had been scammed. I would NOT recommend purchasing the bundle - even on special pricing days like Black Friday. I feel like I simply gave $15 to Amazon and got virtually nothing in return. If I had it to do over again, I definitely would purchase the Paperwhite. I also would buy the Amazon charger and probably a less expensive case. (Even though I suspect a 5watt iPhone charger would work perfectly, I would still purchase the Amazon charger. In the event the device became problematic, the charger would be on the invoice thereby suggesting the device had been properly charged and disallowing refusal to repair or replace due to improper charging.) The device has been wonderful to use, the case is okay, haven't had to use the charger yet (impressive), but the $15 ebook credit seems virtually worthless.",
    }
]

Input: id and its relative text as in json 输入:id及其相对文本,如json

output: id and the count of words appearing in the text 输出:id和出现在文本中的单词数

Say

file = \
[
    {
       "id": 1,
       "text": "Bought this over a month ago and everything came like advertise. I got the purple cover and it looks wonderful. The outlet works just fine and charges my kindle without a problem. I also bought it on sale so it was $20 cheaper. Best. Deal. Ever. Love my kindle paperwhite (love being able to read in the dark too!) Also makes reading at work much easier than a traditional book. Thanks Amazon.",
    },
    {
       "id": 2,
       "text": "Why three stars? Skip the next two paragraphs. Purchased the bundle on Black Friday - great price. The device works as advertised and I'm enjoying it. However, the lighting (even on max) is underwhelming. The features are handy and easy to use (i.e. dictionary, highlighting, bookmark, etc.) The case is attractive and sturdy enough, but the magnetic closure is rather weak. I suspect the case would open easily if the device were dropped.. In retrospect, I probably would have been dollars ahead to purchase a less expensive case separately rather than bundling. The reason for the three (3) stars? The promoted $15 credit towards purchase of ebook(s). After two unsuccessful attempts to redeem the credit and visiting with an Amazon rep, it appears the credit only works for Amazon digital/published books and is NOT applicable to third party publisher/sellers such as HarperCollins, Random House, Simon and Schuster, Penguin, Tyndale, Scholastic, Thomas Nelson, etc. etc. After respectfully telling the rep that this promotion seems very misleading and asking where I could find a list of authors and/or books for which the credit is applicable, he could offer no such list or database. He suggested finding an author on the Amazon ebook list, clicking on a title, putting the book into the order box and then noting the publisher in the order box. If it didn't say Amazon, I would know the credit could not be applied. I have since located several of my favorite writers and pulled up many of their ebooks. As I expected, NONE were available for purchase with the credit. ALL were published by major publishing houses. NONE were published by Amazon digital. I cannot imagine any prolific author of note not being affiliated with major publishing houses - which leaves the enticing ebook credit pretty much useless to me. The language in the Terms and Conditions seems vague at best regarding this restriction. This lack of clarity gives the consumer little, if any, pause regarding the use of the credit. After trying to use it, I felt like I had been scammed. I would NOT recommend purchasing the bundle - even on special pricing days like Black Friday. I feel like I simply gave $15 to Amazon and got virtually nothing in return. If I had it to do over again, I definitely would purchase the Paperwhite. I also would buy the Amazon charger and probably a less expensive case. (Even though I suspect a 5watt iPhone charger would work perfectly, I would still purchase the Amazon charger. In the event the device became problematic, the charger would be on the invoice thereby suggesting the device had been properly charged and disallowing refusal to repair or replace due to improper charging.) The device has been wonderful to use, the case is okay, haven't had to use the charger yet (impressive), but the $15 ebook credit seems virtually worthless.",
    }
]

Dictionary 字典

count = {}
for user in file:
    count[user['id']] = {}
    for word in user['text'].split():
        count[user['id']][word] = count[user['id']].get(word, 0) + 1

Output: 输出:

{1: {'work': 1, 'so': 1, 'like': 1, 'came': 1, 'and': 3, 'problem.': 1, 'over': 1, 'dark': 1, 'the': 2, 'just': 1, 'than': 1, 'Deal.': 1, 'being': 1, 'purple': 1, 'wonderful.': 1, 'reading': 1, 'my': 2, 'Also': 1, 'makes': 1, 'on': 1, 'Love': 1, '(love': 1, 'fine': 1, 'Ever.': 1, 'paperwhite': 1, 'Thanks': 1, 'to': 1, '$20': 1, 'bought': 1, 'book.': 1, 'at': 1, 'traditional': 1, 'read': 1, 'looks': 1, 'in': 1, 'cover': 1, 'kindle': 2, 'cheaper.': 1, 'too!)': 1, 'Best.': 1, 'works': 1, 'Amazon.': 1, 'The': 1, 'it': 3, 'easier': 1, 'this': 1, 'got': 1, 'sale': 1, 'outlet': 1, 'without': 1, 'also': 1, 'advertise.': 1, 'Bought': 1, 'much': 1, 'able': 1, 'everything': 1, 'I': 2, 'ago': 1, 'was': 1, 'a': 3, 'charges': 1, 'month': 1}, 2: {'repair': 1, 'many': 1, 'applied.': 1, 'noting': 1, 'respectfully': 1, 'expected,': 1, 'days': 1, 'several': 1, 'then': 1, 'best': 1, 'very': 1, 'being': 1, 'telling': 1, 'weak.': 1, 'clicking': 1, 'okay,': 1, 'any,': 1, 'got': 1, 'improper': 1, 'to': 12, 'trying': 1, 'use,': 1, 'if': 2, 'became': 1, 'closure': 1, 'is': 6, 'sturdy': 1, 'buy': 1, 'Nelson,': 1, 'features': 1, 'lighting': 1, 'After': 3, '(3)': 1, 'finding': 1, 'putting': 1, 'of': 7, 'unsuccessful': 1, 'say': 1, 'simply': 1, 'which': 2, 'device': 5, 'only': 1, 'attractive': 1, 'max)': 1, 'offer': 1, 'nothing': 1, 'lack': 1, 'Random': 1, 'pulled': 1, 'Paperwhite.': 1, 'this': 2, 'felt': 1, 'visiting': 1, 'appears': 1, 'publisher/sellers': 1, 'two': 2, 'ebooks.': 1, 'are': 1, 'major': 2, 'Tyndale,': 1, 'pretty': 1, 'clarity': 1, 'dollars': 1, 'Penguin,': 1, 'even': 1, 'enticing': 1, '(impressive),': 1, 'price.': 1, 'and': 13, 'over': 1, 'seems': 3, "didn't": 1, 'also': 1, 'order': 2, 'little,': 1, 'Amazon,': 1, 'reason': 1, 'have': 2, 'suggested': 1, 'digital.': 1, '(even': 1, 'redeem': 1, 'no': 1, 'pricing': 1, 'Simon': 1, 'pause': 1, 'cannot': 1, 'on': 6, 'publisher': 1, 'HarperCollins,': 1, 'yet': 1, 'Purchased': 1, 'consumer': 1, 'note': 1, 'attempts': 1, 'imagine': 1, 'box': 1, 'suspect': 2, 'case.': 1, 'an': 2, 'author': 2, 'Skip': 1, 'much': 1, 'published': 2, 'charging.)': 1, 'be': 2, 'affiliated': 1, 'list,': 1, 'expensive': 2, 'digital/published': 1, 'leaves': 1, 'purchasing': 1, 'Why': 1, 'return.': 1, 'Conditions': 1, '5watt': 1, 'vague': 1, 'title,': 1, 'This': 1, 'If': 2, 'know': 1, 'do': 1, 'favorite': 1, 'invoice': 1, 'than': 1, 'Terms': 1, 'House,': 1, 'handy': 1, 'since': 1, 'In': 2, 'up': 1, 'charged': 1, 'definitely': 1, 'purchase': 5, 'like': 3, 'replace': 1, 'rep': 1, 'wonderful': 1, 'the': 35, 'enough,': 1, 'Friday': 1, 'find': 1, 'problematic,': 1, 'been': 4, 'applicable': 1, 'probably': 2, 'bundle': 2, 'open': 1, 'credit': 7, 'However,': 1, 'could': 3, 'paragraphs.': 1, 'As': 1, 'still': 1, 'but': 2, 'restriction.': 1, 'ahead': 1, 'NONE': 2, 'gave': 1, 'charger.': 1, 'language': 1, 'advertised': 1, 'database.': 1, 'again,': 1, 'bundling.': 1, 'dropped..': 1, 'work': 1, 'houses.': 1, 'and/or': 1, 'credit.': 2, 'authors': 1, 'great': 1, 'third': 1, 'he': 1, 'by': 2, 'has': 1, 'promotion': 1, 'dictionary,': 1, 'at': 1, 'works': 2, 'book': 1, 'though': 1, 'it': 3, 'useless': 1, 'it.': 1, 'writers': 1, 'refusal': 1, 'NOT': 2, 'as': 2, 'Schuster,': 1, 'less': 2, 'would': 9, 'I': 17, 'a': 5, 'their': 1, '(i.e.': 1, 'box.': 1, 'enjoying': 1, 'Amazon': 7, '$15': 3, 'separately': 1, 'it,': 1, 'promoted': 1, 'publishing': 2, 'with': 3, "haven't": 1, 'easy': 1, 'magnetic': 1, 'retrospect,': 1, 'ebook(s).': 1, 'Black': 2, 'special': 1, 'list': 2, 'scammed.': 1, 'charger': 4, 'rather': 2, 'located': 1, 'misleading': 1, 'asking': 1, '(Even': 1, 'feel': 1, 'Scholastic,': 1, 'such': 2, 'ebook': 3, 'into': 1, 'recommend': 1, 'Friday.': 1, 'towards': 1, 'Thomas': 1, 'easily': 1, 'gives': 1, 'properly': 1, 'case': 4, 'me.': 1, 'three': 2, 'etc.': 2, 'rep,': 1, 'next': 1, 'bookmark,': 1, 'etc.)': 1, 'my': 1, 'not': 2, 'were': 4, 'in': 3, 'suggesting': 1, 'disallowing': 1, 'iPhone': 1, 'party': 1, 'any': 1, 'where': 1, 'perfectly,': 1, 'regarding': 2, 'applicable,': 1, 'underwhelming.': 1, '-': 3, 'virtually': 2, 'worthless.': 1, 'or': 2, 'had': 4, 'use': 4, 'highlighting,': 1, 'event': 1, 'He': 1, 'houses': 1, 'that': 1, 'for': 4, "I'm": 1, 'The': 7, 'available': 1, 'prolific': 1, 'stars?': 2, 'ALL': 1, 'thereby': 1, 'due': 1, 'books': 2}}

310 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 每个循环310 µs±272 ns(平均±标准偏差,共运行7次,每个循环1000个)


collections.Counter

from collections import Counter

count = {}
for user in file:
    count[user['id']] = Counter()
    for word in user['text'].split():
        count[user['id']][word] += 1

Output: 输出:

{1: Counter({'and': 3, 'it': 3, 'a': 3, 'my': 2, 'kindle': 2, 'I': 2, 'the': 2, 'charges': 1, 'dark': 1, 'reading': 1, 'purple': 1, 'being': 1, 'works': 1, 'outlet': 1, 'read': 1, 'too!)': 1, 'like': 1, 'wonderful.': 1, 'also': 1, 'The': 1, 'much': 1, 'sale': 1, 'paperwhite': 1, 'cover': 1, 'Thanks': 1, 'Best.': 1, 'came': 1, 'Deal.': 1, 'so': 1, 'Ever.': 1, 'ago': 1, 'advertise.': 1, '$20': 1, 'Amazon.': 1, 'bought': 1, 'problem.': 1, 'cheaper.': 1, 'got': 1, 'month': 1, 'work': 1, 'makes': 1, 'just': 1, 'than': 1, 'everything': 1, 'Also': 1, 'this': 1, 'fine': 1, 'able': 1, 'to': 1, 'without': 1, 'was': 1, 'in': 1, 'book.': 1, 'at': 1, 'Bought': 1, 'Love': 1, 'on': 1, 'over': 1, 'looks': 1, '(love': 1, 'traditional': 1, 'easier': 1}), 2: Counter({'the': 35, 'I': 17, 'and': 13, 'to': 12, 'would': 9, 'Amazon': 7, 'credit': 7, 'The': 7, 'of': 7, 'on': 6, 'is': 6, 'a': 5, 'device': 5, 'purchase': 5, 'use': 4, 'been': 4, 'charger': 4, 'case': 4, 'were': 4, 'for': 4, 'had': 4, 'like': 3, 'in': 3, 'it': 3, '-': 3, '$15': 3, 'ebook': 3, 'could': 3, 'seems': 3, 'with': 3, 'After': 3, 'published': 2, 'works': 2, 'two': 2, 'by': 2, 'books': 2, 'In': 2, 'rather': 2, 'or': 2, 'such': 2, 'not': 2, 'probably': 2, 'less': 2, 'be': 2, 'major': 2, 'author': 2, 'NOT': 2, 'which': 2, 'publishing': 2, 'etc.': 2, 'expensive': 2, 'NONE': 2, 'if': 2, 'bundle': 2, 'as': 2, 'have': 2, 'credit.': 2, 'virtually': 2, 'list': 2, 'three': 2, 'Black': 2, 'this': 2, 'an': 2, 'regarding': 2, 'stars?': 2, 'order': 2, 'If': 2, 'suspect': 2, 'but': 2, 'properly': 1, 'charging.)': 1, 'dollars': 1, 'underwhelming.': 1, 'located': 1, 'dropped..': 1, 'suggesting': 1, 'return.': 1, 'much': 1, 'Conditions': 1, 'charger.': 1, 'Scholastic,': 1, 'list,': 1, 'attempts': 1, 'note': 1, 'pause': 1, 'applicable,': 1, 'repair': 1, 'replace': 1, 'and/or': 1, 'box.': 1, 'He': 1, 'invoice': 1, 'clarity': 1, 'Thomas': 1, 'title,': 1, "I'm": 1, 'it,': 1, 'enticing': 1, 'separately': 1, 'event': 1, 'pulled': 1, 'though': 1, 'Tyndale,': 1, 'several': 1, 'use,': 1, 'has': 1, 'noting': 1, 'promotion': 1, 'pretty': 1, 'suggested': 1, 'vague': 1, 'lack': 1, 'bundling.': 1, "haven't": 1, 'houses': 1, 'retrospect,': 1, 'clicking': 1, 'easy': 1, 'Amazon,': 1, 'Schuster,': 1, 'favorite': 1, 'reason': 1, 'many': 1, '(even': 1, 'applicable': 1, 'special': 1, 'iPhone': 1, 'prolific': 1, 'definitely': 1, 'my': 1, 'up': 1, 'wonderful': 1, 'are': 1, 'attractive': 1, 'case.': 1, 'it.': 1, 'redeem': 1, 'know': 1, 'digital/published': 1, 'great': 1, 'no': 1, 'any,': 1, 'As': 1, 'promoted': 1, 'respectfully': 1, 'rep': 1, 'telling': 1, 'ebooks.': 1, "didn't": 1, 'handy': 1, 'However,': 1, 'publisher/sellers': 1, 'disallowing': 1, 'price.': 1, 'perfectly,': 1, 'very': 1, 'worthless.': 1, 'into': 1, 'restriction.': 1, 'magnetic': 1, 'buy': 1, 'next': 1, 'HarperCollins,': 1, 'unsuccessful': 1, 'their': 1, 'find': 1, 'pricing': 1, 'Why': 1, 'language': 1, 'asking': 1, '(Even': 1, 'any': 1, 'imagine': 1, 'trying': 1, 'offer': 1, 'ebook(s).': 1, 'towards': 1, 'Random': 1, 'thereby': 1, 'Paperwhite.': 1, 'Simon': 1, 'third': 1, 'rep,': 1, 'Skip': 1, 'consumer': 1, 'finding': 1, 'affiliated': 1, 'cannot': 1, 'House,': 1, 'houses.': 1, 'say': 1, 'gave': 1, 'enjoying': 1, 'due': 1, 'etc.)': 1, '(impressive),': 1, 'publisher': 1, 'ALL': 1, 'became': 1, 'scammed.': 1, 'gives': 1, 'appears': 1, 'recommend': 1, 'improper': 1, 'problematic,': 1, 'Friday': 1, 'sturdy': 1, 'again,': 1, 'open': 1, 'expected,': 1, 'got': 1, 'dictionary,': 1, 'max)': 1, 'lighting': 1, 'Nelson,': 1, 'feel': 1, 'applied.': 1, 'yet': 1, 'party': 1, 'book': 1, 'enough,': 1, 'available': 1, 'purchasing': 1, 'okay,': 1, 'days': 1, 'bookmark,': 1, 'misleading': 1, 'where': 1, 'putting': 1, 'box': 1, '5watt': 1, 'Friday.': 1, 'felt': 1, 'ahead': 1, 'even': 1, 'authors': 1, 'leaves': 1, 'advertised': 1, 'easily': 1, 'visiting': 1, 'refusal': 1, 'me.': 1, 'Terms': 1, 'only': 1, 'digital.': 1, 'also': 1, 'he': 1, 'useless': 1, 'This': 1, 'still': 1, 'then': 1, 'highlighting,': 1, 'do': 1, 'features': 1, 'Purchased': 1, 'closure': 1, 'database.': 1, 'Penguin,': 1, 'work': 1, 'best': 1, 'than': 1, 'paragraphs.': 1, 'since': 1, 'being': 1, 'that': 1, 'over': 1, 'charged': 1, 'nothing': 1, 'writers': 1, '(i.e.': 1, 'weak.': 1, 'at': 1, '(3)': 1, 'simply': 1, 'little,': 1})}

536 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 每个循环536 µs±858 ns(平均±标准偏差,共运行7次,每个循环1000个)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM