简体   繁体   中英

How Do I Read a Multi-line File of JSON and Count Words in Specific Field in Python

I have a file with many hundreds of lines of json encoded tweets pulled from python-tweetstreamer. The lines look like:

{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "kasian pak weking :| RT @veNikenD: Kasian kenapa???RT @SaputraJordhy: kasian \u256e(\u256f_\u2570)\u256d RT @veNikenD: Tak ingin lg kudengar kata2 yg tak ......", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [{"indices": [24, 33], "screen_name": "veNikenD", "id": 64910664, "name": "Ve Damayanti", "id_str": "64910664"}, {"indices": [54, 68], "screen_name": "SaputraJordhy", "id": 414675856, "name": "jordhy_ynwa", "id_str": "414675856"}, {"indices": [88, 97], "screen_name": "veNikenD", "id": 64910664, "name": "Ve Damayanti", "id_str": "64910664"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": null, "id_str": "187179645026836481", "in_reply_to_screen_name": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/418679759/Young_Minato_and_Kushina_by_HaNa7.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1953774147/Untitled1_normal.png", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 414675856, "profile_text_color": "1c181c", "followers_count": 46, "protected": false, "location": "", "default_profile_image": false, "listed_count": 0, "utc_offset": 25200, "statuses_count": 409, "description": "never walk alone", "friends_count": 76, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/1953774147/Untitled1_normal.png", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "414675856", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/418679759/Young_Minato_and_Kushina_by_HaNa7.jpg", "screen_name": "SaputraJordhy", "lang": "id", "profile_background_tile": true, "favourites_count": 0, "name": "jordhy_ynwa", "url": null, "created_at": "Thu Nov 17 10:41:05 +0000 2011", "contributors_enabled": false, "time_zone": "Jakarta", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645026836481, "source": "<a href=\"https://embr.in\" rel=\"nofollow\">embr</a>"}
{"favorited": false, "in_reply_to_user_id": 441527150, "contributors": null, "truncated": false, "text": "@akoriko1046 \u5bdd\u308b\u306e\uff1f\u3000\u5f85\u3063\u3066\u50d5\u3082\u884c\u304f\u3088\u2026\u5e03\u56e3\u307e\u3067\u304a\u59eb\u69d8\u62b1\u3063\u3053\u3057\u3066\u3044\u3063\u3066\u3042\u3052\u308b", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179532103598080, "coordinates": null, "in_reply_to_user_id_str": "441527150", "entities": {"user_mentions": [{"indices": [0, 12], "screen_name": "akoriko1046", "id": 441527150, "name": "\u30a2\u30b3\u30ea\u30b3", "id_str": "441527150"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179532103598080", "id_str": "187179645014253568", "in_reply_to_screen_name": "akoriko1046", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/440543906/122004876_org.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1870697453/__________bg_normal.jpg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 513679998, "profile_text_color": "333333", "followers_count": 169, "protected": false, "location": "\u3042\u306a\u305f\u306e\u96a3", "default_profile_image": false, "listed_count": 2, "utc_offset": 32400, "statuses_count": 6024, "description": "\u8584\u685c\u9b3c\u6c96\u7530\u7dcf\u53f8\u306e\u975e\u516c\u5f0fbot\u3067\u3059\u7518\u7518/\u30a8\u30ed\u8a2d\u5b9a\u3000\uff8c\uff6b\uff9b\uff70\u306e\u518d\u306f\u5fc5\u305a\u8aac\u660e\u66f8\u3092\u4e00\u8aad\u4e0b\u3055\u3044http://www.pixiv.net/novel/show.php?id=934499  \u624b\u52d5\u3067\u30d5\u30a9\u30ed\u8fd4\u3057\u3092\u884c\u3063\u3066\u307e\u3059\u3000\u7a00\u306b\u4e2d\u306b\u7ba1\u7406\u4eba\u304c\u3044\u307e\u3059\u3000\u7ba1\u7406\u4eba@akanemam1   18\u7981\u7dcf\u53f8\u2192 @sou_oki_18bot", "friends_count": 166, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/1870697453/__________bg_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "513679998", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/440543906/122004876_org.jpg", "screen_name": "sou_oki_bot", "lang": "ja", "profile_background_tile": false, "favourites_count": 1, "name": "\u7dcf\u53f8(bot)", "url": null, "created_at": "Sat Mar 03 22:36:15 +0000 2012", "contributors_enabled": false, "time_zone": "Irkutsk", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645014253568, "source": "<a href=\"http://twittbot.net/\" rel=\"nofollow\">twittbot.net</a>"}
{"favorited": false, "in_reply_to_user_id": 141448885, "contributors": null, "truncated": false, "text": "@nobuttu3 \u6642\u9593\u304c\u904e\u304e\u308b\u306e\u304c\u7269\u51c4\u304f\u65e9\u3044\u3067\u3059\u3088\u306d\u2026", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179547098234880, "coordinates": null, "in_reply_to_user_id_str": "141448885", "entities": {"user_mentions": [{"indices": [0, 9], "screen_name": "nobuttu3", "id": 141448885, "name": "\u306e\u4ecf \uf8ff \u30bf\u30ab\u30cf\u30b7", "id_str": "141448885"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179547098234880", "id_str": "187179645047799808", "in_reply_to_screen_name": "nobuttu3", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/424878981/tw_hvt_nz.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1223811261/twitter_icon_normal.png", "profile_sidebar_fill_color": "daecf4", "is_translator": false, "id": 97481308, "profile_text_color": "663B12", "followers_count": 436, "protected": false, "location": "\u6771\u4eac\u90fd\u53f0\u6771\u533a", "default_profile_image": false, "listed_count": 20, "utc_offset": 32400, "statuses_count": 63704, "description": "\u591a\u5206PG\u3001\u6642\u3005SE\u307d\u3044\u4ed5\u4e8b\u3092\u3057\u3066\u3044\u307e\u3059\u3002\u30e9\u30ce\u30d9\u597d\u304d\u3001\u97f3\u697d\u597d\u304d(\u7279\u5b9a\u306e\u5206\u91ce\u3067\u3059\u304c)\u3002\u30bd\u30b3\u30bd\u30b3\u306e\u983b\u5ea6\u3067\u79cb\u8449\u539f\u306b\u3044\u305f\u308a\u3082\u3057\u307e\u3059\u3002 ", "friends_count": 896, "profile_link_color": "1F98C7", "profile_image_url": "http://a0.twimg.com/profile_images/1223811261/twitter_icon_normal.png", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "ffffff", "id_str": "97481308", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/424878981/tw_hvt_nz.jpg", "screen_name": "xi6", "lang": "ja", "profile_background_tile": false, "favourites_count": 4473, "name": "\u3055\u304f", "url": null, "created_at": "Thu Dec 17 16:55:25 +0000 2009", "contributors_enabled": false, "time_zone": "Tokyo", "profile_sidebar_border_color": "C6E2EE", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645047799808, "source": "<a href=\"http://tapbots.com/tweetbot\" rel=\"nofollow\">Tweetbot for iOS</a>"}
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "#ImSingleBecause lolz I'm not. Happily taken by @GarrettBettler &lt;33 I love him,  forever :)", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [{"indices": [48, 63], "screen_name": "GarrettBettler", "id": 460816116, "name": "Garrett Bettler", "id_str": "460816116"}], "hashtags": [{"indices": [0, 16], "text": "ImSingleBecause"}], "urls": []}, "in_reply_to_status_id_str": null, "id_str": "187179645039427584", "in_reply_to_screen_name": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/452188318/tanja_beach_2007_001.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1971847266/image_normal.jpg", "profile_sidebar_fill_color": "f6ffd1", "is_translator": false, "id": 461432420, "profile_text_color": "333333", "followers_count": 222, "protected": false, "location": "", "default_profile_image": false, "listed_count": 0, "utc_offset": null, "statuses_count": 2334, "description": "", "friends_count": 192, "profile_link_color": "0099CC", "profile_image_url": "http://a0.twimg.com/profile_images/1971847266/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "FFF04D", "id_str": "461432420", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/452188318/tanja_beach_2007_001.JPG", "screen_name": "LeahOswalt", "lang": "en", "profile_background_tile": false, "favourites_count": 86, "name": "Leah Oswalt", "url": null, "created_at": "Wed Jan 11 20:07:24 +0000 2012", "contributors_enabled": false, "time_zone": null, "profile_sidebar_border_color": "fff8ad", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645039427584, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 434884235, "contributors": null, "truncated": false, "text": "@nomimushi_ttk \u3068\u30fc\u3084\u3082\u7d20\u6575\u3060\u3051\u3069\u306e\u307f\u3080\u3057\u306e\u30a2\u30a4\u30b3\u30f3\u5929\u4f7f\u3059\u304e\u3066", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179241664815105, "coordinates": null, "in_reply_to_user_id_str": "434884235", "entities": {"user_mentions": [{"indices": [0, 14], "screen_name": "nomimushi_ttk", "id": 434884235, "name": "\u306e\u307f\u3080\u3057", "id_str": "434884235"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179241664815105", "id_str": "187179645026836480", "in_reply_to_screen_name": "nomimushi_ttk", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme1/bg.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/2023688835/6222d8ae387e6fb5b220895d2fd2d41a_normal.gif", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 365471550, "profile_text_color": "333333", "followers_count": 308, "protected": false, "location": "\u5b66\u5712\u30a2\u30ea\u30b9\u306b\u518d\u71b1\u306a\u3046", "default_profile_image": false, "listed_count": 17, "utc_offset": 32400, "statuses_count": 25562, "description": "\uff8c\uff9e\uff9a10/\u3046\u305f\u30d7\u30ea/HTF/\u3044\u306c\u307c\u304f\u306a\u3069\u306b\u304a\u71b1/\u5d50\u306e\u5927\u91ce\u304f\u3093\u3059\u304d\uff01\u64ec\u4eba\u5316\u3082\u3050\u3082\u3050/\u30a4\u30ca\u30a4\u30ec/RKRN/pkmn/\uff83\uff86\uff8c\uff9f\uff98/\u4e59\u5973\uff79\uff9e\uff70\u5168\u822c\u3082 [\u30bf\u30ab\u4e38\u3055\u3093\u30e2\u30b0\u30e2\u30b0\u30da\u30c3\u3063\u3066\u3057\u968a\u54e1No.2\uff3c\u526f\u968a\u9577\uff0f]\u3000\u898f\u5236\u57a2\u3010@ao_sanagi_2\u3011\u30a2\u30a4\u30b3\u30f3\u306f\u3068\u30fc\u3084\u304b\u3089\uff01", "friends_count": 284, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/2023688835/6222d8ae387e6fb5b220895d2fd2d41a_normal.gif", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "365471550", "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png", "screen_name": "ao_sanagi", "lang": "ja", "profile_background_tile": false, "favourites_count": 1071, "name": "\u8475@\u6284\u82b1\u306e\u5ac1", "url": null, "created_at": "Wed Aug 31 14:07:23 +0000 2011", "contributors_enabled": false, "time_zone": "Tokyo", "profile_sidebar_border_color": "C0DEED", "default_profile": true, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645026836480, "source": "<a href=\"http://www.movatwi.jp\" rel=\"nofollow\">\u30e2\u30d0\u30c4\u30a4 / www.movatwi.jp .</a>"}

My end goal is to count the number of times a specific word occurs in the "text" field of all the tweets. I have tried a number of different approaches with varying degrees of success but here's where I'm at:

import fileinput
import json
import sys
import os

line = []

inputfilename = sys.argv[1]

for line in fileinput.input([inputfilename]):
  tweettext = json.loads(line).get('text').split()
  print tweettext

This loops through the file and splits the text up into the individual words from each lines "text" field but does not create a single list of words. To add to the issue when it runs into a blank line it fails:

[u'RT', u'@keenakan:', u'kamu', u'tidak', u'perlu', u'memperjuangkan', u'aku.', u'Yang', u'perlu', u'ialah', u'aku', u'dan', u'kamu', u'yang', u'memperjuangkan', u'kita.', u'-@commaditya']
[u'RT', u'@TheRealToxicBoi:', u'#LiesBeforeSex', u"I'll", u'be', u'Gentle!']
[u'@coliriostar', u'Quer', u'GANHAR', u'R$', u'300,00', u'em', u'vale', u'compra?', u'SIGA', u'@eucompronanet', u'e', u'saiba', u'como', u'participar,', u'\xe9', u'simples', u'e', u'r\xe1pido!', u'at\xe9', u'+', u'ci']
Traceback (most recent call last):
  File "newexample.py", line 11, in <module>
    tweettext = json.loads(line).get('text').split()
AttributeError: 'NoneType' object has no attribute 'split'

Can anyone suggest a solution?

edit:

Based on the first comment I've edited the code as follows based on my understanding:

import fileinput
import json
import sys
import os

line = []
tw = 0

inputfilename = sys.argv[1]

for line in fileinput.input([inputfilename]):
        line = line.strip();
        if not line: continue
        tweettext = json.loads(line).get('text')
        if not json.loads(line).get('text'):
                continue
        words = tweettext.split()
        print words
        tw = len(words)

print "total number of words", tw

my output is looking better, at least I'm not getting the "Attribute Error: NoneType" anymore. Now the output seems to consist of individual dictionaries instead of just one large dict. Again my goal is count how many times each word occurs and I'm not sure how to do that unless they are all in one dict. Here's a sample of the output at this point:

[u'L', u'Lawliet', u'(Sweets', u'Addict)', u'+', u'Kenshin', u'Himura', u'(Samurai)', u'+', u'Kyon', u'(Lazy', u'and', u'Carefree', u'Bum)', u'=', u'Sakata', u'Gintoki', u'xD', u'May...', u'http://t.co/LD4E1j1v']
[u'Yay', u'~', u'I', u'have', u'ice~I', u'can', u'reach', u'the', u'ice', u'maker!', u'ch', u'sees', u'gaps', u'in', u'the', u'freezer', u'as', u'a', u'challenge', u'and', u"it's", u'usually', u'full', u'to', u'busting.', u'But', u'not', u'now', u'Haha!']
[u'Hoi']
[u'everyones', u'on', u'twitter.']
total number of words 429023

I would guess that I can probably setup counters for each word within the for loop somehow.? As you you can see the total word count works fine because it adds the number of words from each line, but I can't quite see how I would do it to determine unique words like:

len(set(words))

EDIT:

Here's my final solution:

import fileinput
import json
import sys
import os
from collections import defaultdict

line = []
tw = 0

inputfilename = sys.argv[1]

word_count = defaultdict(int)

for line in fileinput.input([inputfilename]):
        line = line.strip();
        if not line: continue
        tweettext = json.loads(line).get('text')
        if not json.loads(line).get('text'):
                continue
        words = tweettext.split()
        tw += len(words)
        for word in words:
                word_count[word]+=1

print word_count
print "total number of words", tw

You seem to be on right track just add error checking eg

Check if a line is blank before looading it as json , also strip the line just to be sure eg

line = line.strip(); 
if not line: continue

Check if json data has really any text in it

if not json.loads(line).get('text'):
    continue

After that you should loop thru words and may be create a dict eg

word_count = defaultdict(int)
for line in file:
    # get words and add them to dict
    for word in words:
        word_count[word]+=1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM