繁体   English   中英

Python - Map / Reduce - 如何在使用 DISCO 计数单词示例中读取 JSON 特定字段

[英]Python - Map / Reduce - How do I read JSON specific field in using DISCO count words example

我正在使用 DISCO 示例来计算文件中的单词:

将单词计数为 map/reduce 作业

我没有问题,但我想尝试从包含 JSON 字符串的文本文件中读取特定字段。

该文件有如下几行:

{"favorited": false, "in_reply_to_user_id": 306846931, "contributors": null, "truncated": false, "text": "@CataDuarte8 No! av\u00edseme cuando vaya ah salir para yo salir igual!", "created_at": "Wed Apr 04 20:25:37 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187636960632901632, "coordinates": null, "id": 187637067415683073, "entities": {"user_mentions": [{"indices": [0, 12], "id_str": "306846931", "id": 306846931, "name": "Catalina Ria\u00f1o!\u2661", "screen_name": "CataDuarte8"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187636960632901632", "id_str": "187637067415683073", "in_reply_to_screen_name": "CataDuarte8", "user": {"follow_request_sent": null, "profile_use_background_image": true, "id": 286402064, "description": "Cada quien RECOJE lo que SIEMBRA (:\r\n\u2551\u258c\u2502\u2551\u2502\u2551\u258c\u2502\u2588\u2551\u2502\u2551\u258c\u2502\u2551\u258c\u2551 ", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "profile_sidebar_fill_color": "525252", "is_translator": false, "geo_enabled": false, "profile_text_color": "ffffff", "followers_count": 620, "protected": false, "location": "", "default_profile_image": false, "id_str": "286402064", "utc_offset": -21600, "statuses_count": 16395, "profile_background_color": "000000", "friends_count": 537, "profile_link_color": "ff0000", "profile_image_url": "http://a0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "notifications": null, "show_all_inline_media": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "screen_name": "LadyRomeroo", "lang": "es", "profile_background_tile": true, "favourites_count": 136, "name": "Lady Romero \u2605", "url": "http://www.facebook.com/profile.php?id=1640385164", "created_at": "Fri Apr 22 23:04:41 +0000 2011", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "0a5b80", "default_profile": false, "following": null, "listed_count": 0}, "place": null, "retweet_count": 0, "geo": null, "in_reply_to_user_id_str": "306846931", "source": "web"}

我只对“文本”键值字段感兴趣。 在python中我可以这样做:

import simplejson
f = open("file.json", "r")
for line in f:
    r = simplejson.loads(line).get('text')
    print r

它返回所有文本字段值,如:

@_MuitoMais_  ´vcs são d  msm amei o pode ou ão pode e a entrevist com a @claudialeitte =)

这很好用,但是当我尝试将相同的方法应用于迪斯科附带的示例 count_words.py 时,如下所示:

from disco.core import Job, result_iterator
import simplejson

def map(line, params):
    r = simplejson.loads(line).get('text')
    for word in r.split():
        yield word, 1

def reduce(iter, params):
    from disco.util import kvgroup
    for word, counts in kvgroup(sorted(iter)):
        yield word, sum(counts)

if __name__ == '__main__':
    job = Job().run(input=["/tmp/file.json"],
                    map=map,
                    reduce=reduce)
    for word, count in result_iterator(job.wait(show=True)):
        print word, count

我收到以下错误:

# python test.py 
Job@549:b4c76:9cbb1:
Status: [map] 0 waiting, 1 running, 0 done, 0 failed
2012/11/24 02:01:10  master     New job initialized!
2012/11/24 02:01:10  master     Starting job
2012/11/24 02:01:10  master     Starting map phase
2012/11/24 02:01:10  master     map:0 assigned to comp1
2012/11/24 02:01:11  master     ERROR: Job failed: Worker at 'comp1' died: Traceback (most recent call last):
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main                               
    job.worker.start(task, job, **jobargs)                                                                                                                              
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start                              
    self.run(task, job, **jobargs)                                                                                                                                      
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run                          
    getattr(self, task.mode)(task, params)                                                                                                                              
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 302, in map                          
    part = str(self['partition'](key, self['partitions'], params))                                                                                                      
  File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/func.py", line 341, in default_partition              
    return hash(str(key)) % nr_partitions                                                                                                                               
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)                                                               

2012/11/24 02:01:11  master     WARN: Job killed
Status: [map] 1 waiting, 0 running, 0 done, 1 failed
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    for word, count in result_iterator(job.wait(show=True)):
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait
    timeout, poll_interval * 1000)
  File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results
    raise JobError(Job(name=jobname, master=self), "Status %s" % status)
disco.error.JobError: Job Job@549:b4c76:9cbb1 failed: Status dead

看起来这应该是直截了当的,但我显然错过了一些东西。

任何人都可以帮忙吗?

你的问题是在disco/worker/classic/func.py ...... str()将不接受 unicode 字符......

>>> str(u'\xb4')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)
>>>

由于您只计算单词,因此您可以使用unicodedata模块将您的 unicode 数据转换为字符串...

import json
import unicodedata
f = open('file.json')
for line in f:
    r = json.loads(line).get('text')
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
    print r
    print s

输出:

@CataDuarte8 No! avíseme cuando vaya ah salir para yo salir igual!
@CataDuarte8 No! aviseme cuando vaya ah salir para yo salir igual!

将此应用于您的问题...将您的map()函数重写为...

def map(line, params):
    r = simplejson.loads(line).get('text')
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
    for word in s.split():
        yield word, 1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM