简体   繁体   中英

Counting word frequency of each individual user

I am trying to generate statistics for the users word frequency as they have given in their reviews like

user 1: word frequencies

user 2: word frequencies and so on...

How can I do that?

Here I am trying to access review of each user but it gives me an error.

Please suggest the approach and sudo code.

import json
from pprint import pprint

file = open('/Users/mack/Downloads/WKA/task/reviews.json','r')
content = file.read()
file = json.loads(content)

for eid, txt in file["id"]["text"]:
    print(eid, txt)

A big json like this:

[
    {
       "id": 1,
       "text": "Bought this over a month ago and everything came like advertise. I got the purple cover and it looks wonderful. The outlet works just fine and charges my kindle without a problem. I also bought it on sale so it was $20 cheaper. Best. Deal. Ever. Love my kindle paperwhite (love being able to read in the dark too!) Also makes reading at work much easier than a traditional book. Thanks Amazon.",
    },
    {
       "id": 2,
       "text": "Why three stars? Skip the next two paragraphs. Purchased the bundle on Black Friday - great price. The device works as advertised and I'm enjoying it. However, the lighting (even on max) is underwhelming. The features are handy and easy to use (i.e. dictionary, highlighting, bookmark, etc.) The case is attractive and sturdy enough, but the magnetic closure is rather weak. I suspect the case would open easily if the device were dropped.. In retrospect, I probably would have been dollars ahead to purchase a less expensive case separately rather than bundling. The reason for the three (3) stars? The promoted $15 credit towards purchase of ebook(s). After two unsuccessful attempts to redeem the credit and visiting with an Amazon rep, it appears the credit only works for Amazon digital/published books and is NOT applicable to third party publisher/sellers such as HarperCollins, Random House, Simon and Schuster, Penguin, Tyndale, Scholastic, Thomas Nelson, etc. etc. After respectfully telling the rep that this promotion seems very misleading and asking where I could find a list of authors and/or books for which the credit is applicable, he could offer no such list or database. He suggested finding an author on the Amazon ebook list, clicking on a title, putting the book into the order box and then noting the publisher in the order box. If it didn't say Amazon, I would know the credit could not be applied. I have since located several of my favorite writers and pulled up many of their ebooks. As I expected, NONE were available for purchase with the credit. ALL were published by major publishing houses. NONE were published by Amazon digital. I cannot imagine any prolific author of note not being affiliated with major publishing houses - which leaves the enticing ebook credit pretty much useless to me. The language in the Terms and Conditions seems vague at best regarding this restriction. This lack of clarity gives the consumer little, if any, pause regarding the use of the credit. After trying to use it, I felt like I had been scammed. I would NOT recommend purchasing the bundle - even on special pricing days like Black Friday. I feel like I simply gave $15 to Amazon and got virtually nothing in return. If I had it to do over again, I definitely would purchase the Paperwhite. I also would buy the Amazon charger and probably a less expensive case. (Even though I suspect a 5watt iPhone charger would work perfectly, I would still purchase the Amazon charger. In the event the device became problematic, the charger would be on the invoice thereby suggesting the device had been properly charged and disallowing refusal to repair or replace due to improper charging.) The device has been wonderful to use, the case is okay, haven't had to use the charger yet (impressive), but the $15 ebook credit seems virtually worthless.",
    }
]

Input: id and its relative text as in json

output: id and the count of words appearing in the text

Say

file = \
[
    {
       "id": 1,
       "text": "Bought this over a month ago and everything came like advertise. I got the purple cover and it looks wonderful. The outlet works just fine and charges my kindle without a problem. I also bought it on sale so it was $20 cheaper. Best. Deal. Ever. Love my kindle paperwhite (love being able to read in the dark too!) Also makes reading at work much easier than a traditional book. Thanks Amazon.",
    },
    {
       "id": 2,
       "text": "Why three stars? Skip the next two paragraphs. Purchased the bundle on Black Friday - great price. The device works as advertised and I'm enjoying it. However, the lighting (even on max) is underwhelming. The features are handy and easy to use (i.e. dictionary, highlighting, bookmark, etc.) The case is attractive and sturdy enough, but the magnetic closure is rather weak. I suspect the case would open easily if the device were dropped.. In retrospect, I probably would have been dollars ahead to purchase a less expensive case separately rather than bundling. The reason for the three (3) stars? The promoted $15 credit towards purchase of ebook(s). After two unsuccessful attempts to redeem the credit and visiting with an Amazon rep, it appears the credit only works for Amazon digital/published books and is NOT applicable to third party publisher/sellers such as HarperCollins, Random House, Simon and Schuster, Penguin, Tyndale, Scholastic, Thomas Nelson, etc. etc. After respectfully telling the rep that this promotion seems very misleading and asking where I could find a list of authors and/or books for which the credit is applicable, he could offer no such list or database. He suggested finding an author on the Amazon ebook list, clicking on a title, putting the book into the order box and then noting the publisher in the order box. If it didn't say Amazon, I would know the credit could not be applied. I have since located several of my favorite writers and pulled up many of their ebooks. As I expected, NONE were available for purchase with the credit. ALL were published by major publishing houses. NONE were published by Amazon digital. I cannot imagine any prolific author of note not being affiliated with major publishing houses - which leaves the enticing ebook credit pretty much useless to me. The language in the Terms and Conditions seems vague at best regarding this restriction. This lack of clarity gives the consumer little, if any, pause regarding the use of the credit. After trying to use it, I felt like I had been scammed. I would NOT recommend purchasing the bundle - even on special pricing days like Black Friday. I feel like I simply gave $15 to Amazon and got virtually nothing in return. If I had it to do over again, I definitely would purchase the Paperwhite. I also would buy the Amazon charger and probably a less expensive case. (Even though I suspect a 5watt iPhone charger would work perfectly, I would still purchase the Amazon charger. In the event the device became problematic, the charger would be on the invoice thereby suggesting the device had been properly charged and disallowing refusal to repair or replace due to improper charging.) The device has been wonderful to use, the case is okay, haven't had to use the charger yet (impressive), but the $15 ebook credit seems virtually worthless.",
    }
]

Dictionary

count = {}
for user in file:
    count[user['id']] = {}
    for word in user['text'].split():
        count[user['id']][word] = count[user['id']].get(word, 0) + 1

Output:

{1: {'work': 1, 'so': 1, 'like': 1, 'came': 1, 'and': 3, 'problem.': 1, 'over': 1, 'dark': 1, 'the': 2, 'just': 1, 'than': 1, 'Deal.': 1, 'being': 1, 'purple': 1, 'wonderful.': 1, 'reading': 1, 'my': 2, 'Also': 1, 'makes': 1, 'on': 1, 'Love': 1, '(love': 1, 'fine': 1, 'Ever.': 1, 'paperwhite': 1, 'Thanks': 1, 'to': 1, '$20': 1, 'bought': 1, 'book.': 1, 'at': 1, 'traditional': 1, 'read': 1, 'looks': 1, 'in': 1, 'cover': 1, 'kindle': 2, 'cheaper.': 1, 'too!)': 1, 'Best.': 1, 'works': 1, 'Amazon.': 1, 'The': 1, 'it': 3, 'easier': 1, 'this': 1, 'got': 1, 'sale': 1, 'outlet': 1, 'without': 1, 'also': 1, 'advertise.': 1, 'Bought': 1, 'much': 1, 'able': 1, 'everything': 1, 'I': 2, 'ago': 1, 'was': 1, 'a': 3, 'charges': 1, 'month': 1}, 2: {'repair': 1, 'many': 1, 'applied.': 1, 'noting': 1, 'respectfully': 1, 'expected,': 1, 'days': 1, 'several': 1, 'then': 1, 'best': 1, 'very': 1, 'being': 1, 'telling': 1, 'weak.': 1, 'clicking': 1, 'okay,': 1, 'any,': 1, 'got': 1, 'improper': 1, 'to': 12, 'trying': 1, 'use,': 1, 'if': 2, 'became': 1, 'closure': 1, 'is': 6, 'sturdy': 1, 'buy': 1, 'Nelson,': 1, 'features': 1, 'lighting': 1, 'After': 3, '(3)': 1, 'finding': 1, 'putting': 1, 'of': 7, 'unsuccessful': 1, 'say': 1, 'simply': 1, 'which': 2, 'device': 5, 'only': 1, 'attractive': 1, 'max)': 1, 'offer': 1, 'nothing': 1, 'lack': 1, 'Random': 1, 'pulled': 1, 'Paperwhite.': 1, 'this': 2, 'felt': 1, 'visiting': 1, 'appears': 1, 'publisher/sellers': 1, 'two': 2, 'ebooks.': 1, 'are': 1, 'major': 2, 'Tyndale,': 1, 'pretty': 1, 'clarity': 1, 'dollars': 1, 'Penguin,': 1, 'even': 1, 'enticing': 1, '(impressive),': 1, 'price.': 1, 'and': 13, 'over': 1, 'seems': 3, "didn't": 1, 'also': 1, 'order': 2, 'little,': 1, 'Amazon,': 1, 'reason': 1, 'have': 2, 'suggested': 1, 'digital.': 1, '(even': 1, 'redeem': 1, 'no': 1, 'pricing': 1, 'Simon': 1, 'pause': 1, 'cannot': 1, 'on': 6, 'publisher': 1, 'HarperCollins,': 1, 'yet': 1, 'Purchased': 1, 'consumer': 1, 'note': 1, 'attempts': 1, 'imagine': 1, 'box': 1, 'suspect': 2, 'case.': 1, 'an': 2, 'author': 2, 'Skip': 1, 'much': 1, 'published': 2, 'charging.)': 1, 'be': 2, 'affiliated': 1, 'list,': 1, 'expensive': 2, 'digital/published': 1, 'leaves': 1, 'purchasing': 1, 'Why': 1, 'return.': 1, 'Conditions': 1, '5watt': 1, 'vague': 1, 'title,': 1, 'This': 1, 'If': 2, 'know': 1, 'do': 1, 'favorite': 1, 'invoice': 1, 'than': 1, 'Terms': 1, 'House,': 1, 'handy': 1, 'since': 1, 'In': 2, 'up': 1, 'charged': 1, 'definitely': 1, 'purchase': 5, 'like': 3, 'replace': 1, 'rep': 1, 'wonderful': 1, 'the': 35, 'enough,': 1, 'Friday': 1, 'find': 1, 'problematic,': 1, 'been': 4, 'applicable': 1, 'probably': 2, 'bundle': 2, 'open': 1, 'credit': 7, 'However,': 1, 'could': 3, 'paragraphs.': 1, 'As': 1, 'still': 1, 'but': 2, 'restriction.': 1, 'ahead': 1, 'NONE': 2, 'gave': 1, 'charger.': 1, 'language': 1, 'advertised': 1, 'database.': 1, 'again,': 1, 'bundling.': 1, 'dropped..': 1, 'work': 1, 'houses.': 1, 'and/or': 1, 'credit.': 2, 'authors': 1, 'great': 1, 'third': 1, 'he': 1, 'by': 2, 'has': 1, 'promotion': 1, 'dictionary,': 1, 'at': 1, 'works': 2, 'book': 1, 'though': 1, 'it': 3, 'useless': 1, 'it.': 1, 'writers': 1, 'refusal': 1, 'NOT': 2, 'as': 2, 'Schuster,': 1, 'less': 2, 'would': 9, 'I': 17, 'a': 5, 'their': 1, '(i.e.': 1, 'box.': 1, 'enjoying': 1, 'Amazon': 7, '$15': 3, 'separately': 1, 'it,': 1, 'promoted': 1, 'publishing': 2, 'with': 3, "haven't": 1, 'easy': 1, 'magnetic': 1, 'retrospect,': 1, 'ebook(s).': 1, 'Black': 2, 'special': 1, 'list': 2, 'scammed.': 1, 'charger': 4, 'rather': 2, 'located': 1, 'misleading': 1, 'asking': 1, '(Even': 1, 'feel': 1, 'Scholastic,': 1, 'such': 2, 'ebook': 3, 'into': 1, 'recommend': 1, 'Friday.': 1, 'towards': 1, 'Thomas': 1, 'easily': 1, 'gives': 1, 'properly': 1, 'case': 4, 'me.': 1, 'three': 2, 'etc.': 2, 'rep,': 1, 'next': 1, 'bookmark,': 1, 'etc.)': 1, 'my': 1, 'not': 2, 'were': 4, 'in': 3, 'suggesting': 1, 'disallowing': 1, 'iPhone': 1, 'party': 1, 'any': 1, 'where': 1, 'perfectly,': 1, 'regarding': 2, 'applicable,': 1, 'underwhelming.': 1, '-': 3, 'virtually': 2, 'worthless.': 1, 'or': 2, 'had': 4, 'use': 4, 'highlighting,': 1, 'event': 1, 'He': 1, 'houses': 1, 'that': 1, 'for': 4, "I'm": 1, 'The': 7, 'available': 1, 'prolific': 1, 'stars?': 2, 'ALL': 1, 'thereby': 1, 'due': 1, 'books': 2}}

310 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


collections.Counter

from collections import Counter

count = {}
for user in file:
    count[user['id']] = Counter()
    for word in user['text'].split():
        count[user['id']][word] += 1

Output:

{1: Counter({'and': 3, 'it': 3, 'a': 3, 'my': 2, 'kindle': 2, 'I': 2, 'the': 2, 'charges': 1, 'dark': 1, 'reading': 1, 'purple': 1, 'being': 1, 'works': 1, 'outlet': 1, 'read': 1, 'too!)': 1, 'like': 1, 'wonderful.': 1, 'also': 1, 'The': 1, 'much': 1, 'sale': 1, 'paperwhite': 1, 'cover': 1, 'Thanks': 1, 'Best.': 1, 'came': 1, 'Deal.': 1, 'so': 1, 'Ever.': 1, 'ago': 1, 'advertise.': 1, '$20': 1, 'Amazon.': 1, 'bought': 1, 'problem.': 1, 'cheaper.': 1, 'got': 1, 'month': 1, 'work': 1, 'makes': 1, 'just': 1, 'than': 1, 'everything': 1, 'Also': 1, 'this': 1, 'fine': 1, 'able': 1, 'to': 1, 'without': 1, 'was': 1, 'in': 1, 'book.': 1, 'at': 1, 'Bought': 1, 'Love': 1, 'on': 1, 'over': 1, 'looks': 1, '(love': 1, 'traditional': 1, 'easier': 1}), 2: Counter({'the': 35, 'I': 17, 'and': 13, 'to': 12, 'would': 9, 'Amazon': 7, 'credit': 7, 'The': 7, 'of': 7, 'on': 6, 'is': 6, 'a': 5, 'device': 5, 'purchase': 5, 'use': 4, 'been': 4, 'charger': 4, 'case': 4, 'were': 4, 'for': 4, 'had': 4, 'like': 3, 'in': 3, 'it': 3, '-': 3, '$15': 3, 'ebook': 3, 'could': 3, 'seems': 3, 'with': 3, 'After': 3, 'published': 2, 'works': 2, 'two': 2, 'by': 2, 'books': 2, 'In': 2, 'rather': 2, 'or': 2, 'such': 2, 'not': 2, 'probably': 2, 'less': 2, 'be': 2, 'major': 2, 'author': 2, 'NOT': 2, 'which': 2, 'publishing': 2, 'etc.': 2, 'expensive': 2, 'NONE': 2, 'if': 2, 'bundle': 2, 'as': 2, 'have': 2, 'credit.': 2, 'virtually': 2, 'list': 2, 'three': 2, 'Black': 2, 'this': 2, 'an': 2, 'regarding': 2, 'stars?': 2, 'order': 2, 'If': 2, 'suspect': 2, 'but': 2, 'properly': 1, 'charging.)': 1, 'dollars': 1, 'underwhelming.': 1, 'located': 1, 'dropped..': 1, 'suggesting': 1, 'return.': 1, 'much': 1, 'Conditions': 1, 'charger.': 1, 'Scholastic,': 1, 'list,': 1, 'attempts': 1, 'note': 1, 'pause': 1, 'applicable,': 1, 'repair': 1, 'replace': 1, 'and/or': 1, 'box.': 1, 'He': 1, 'invoice': 1, 'clarity': 1, 'Thomas': 1, 'title,': 1, "I'm": 1, 'it,': 1, 'enticing': 1, 'separately': 1, 'event': 1, 'pulled': 1, 'though': 1, 'Tyndale,': 1, 'several': 1, 'use,': 1, 'has': 1, 'noting': 1, 'promotion': 1, 'pretty': 1, 'suggested': 1, 'vague': 1, 'lack': 1, 'bundling.': 1, "haven't": 1, 'houses': 1, 'retrospect,': 1, 'clicking': 1, 'easy': 1, 'Amazon,': 1, 'Schuster,': 1, 'favorite': 1, 'reason': 1, 'many': 1, '(even': 1, 'applicable': 1, 'special': 1, 'iPhone': 1, 'prolific': 1, 'definitely': 1, 'my': 1, 'up': 1, 'wonderful': 1, 'are': 1, 'attractive': 1, 'case.': 1, 'it.': 1, 'redeem': 1, 'know': 1, 'digital/published': 1, 'great': 1, 'no': 1, 'any,': 1, 'As': 1, 'promoted': 1, 'respectfully': 1, 'rep': 1, 'telling': 1, 'ebooks.': 1, "didn't": 1, 'handy': 1, 'However,': 1, 'publisher/sellers': 1, 'disallowing': 1, 'price.': 1, 'perfectly,': 1, 'very': 1, 'worthless.': 1, 'into': 1, 'restriction.': 1, 'magnetic': 1, 'buy': 1, 'next': 1, 'HarperCollins,': 1, 'unsuccessful': 1, 'their': 1, 'find': 1, 'pricing': 1, 'Why': 1, 'language': 1, 'asking': 1, '(Even': 1, 'any': 1, 'imagine': 1, 'trying': 1, 'offer': 1, 'ebook(s).': 1, 'towards': 1, 'Random': 1, 'thereby': 1, 'Paperwhite.': 1, 'Simon': 1, 'third': 1, 'rep,': 1, 'Skip': 1, 'consumer': 1, 'finding': 1, 'affiliated': 1, 'cannot': 1, 'House,': 1, 'houses.': 1, 'say': 1, 'gave': 1, 'enjoying': 1, 'due': 1, 'etc.)': 1, '(impressive),': 1, 'publisher': 1, 'ALL': 1, 'became': 1, 'scammed.': 1, 'gives': 1, 'appears': 1, 'recommend': 1, 'improper': 1, 'problematic,': 1, 'Friday': 1, 'sturdy': 1, 'again,': 1, 'open': 1, 'expected,': 1, 'got': 1, 'dictionary,': 1, 'max)': 1, 'lighting': 1, 'Nelson,': 1, 'feel': 1, 'applied.': 1, 'yet': 1, 'party': 1, 'book': 1, 'enough,': 1, 'available': 1, 'purchasing': 1, 'okay,': 1, 'days': 1, 'bookmark,': 1, 'misleading': 1, 'where': 1, 'putting': 1, 'box': 1, '5watt': 1, 'Friday.': 1, 'felt': 1, 'ahead': 1, 'even': 1, 'authors': 1, 'leaves': 1, 'advertised': 1, 'easily': 1, 'visiting': 1, 'refusal': 1, 'me.': 1, 'Terms': 1, 'only': 1, 'digital.': 1, 'also': 1, 'he': 1, 'useless': 1, 'This': 1, 'still': 1, 'then': 1, 'highlighting,': 1, 'do': 1, 'features': 1, 'Purchased': 1, 'closure': 1, 'database.': 1, 'Penguin,': 1, 'work': 1, 'best': 1, 'than': 1, 'paragraphs.': 1, 'since': 1, 'being': 1, 'that': 1, 'over': 1, 'charged': 1, 'nothing': 1, 'writers': 1, '(i.e.': 1, 'weak.': 1, 'at': 1, '(3)': 1, 'simply': 1, 'little,': 1})}

536 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM