简体   繁体   English

MRJOB打开JSON文件 - Python

[英]MRJOB open JSON file - Python

I am trying to load a json file as part of the mapper function but it returns "No such file in directory" although the file is existent. 我试图加载一个json文件作为映射器函数的一部分,但它返回“目录中没有这样的文件”,虽然该文件存在。

I am already opening a file and parsing through its lines. 我已经打开一个文件并解析它的行。 But want to compare some of its values to a second JSON file. 但是想将它的一些值与第二个JSON文件进行比较。

from mrjob.job import MRJob
import json
import nltk
import re    

WORD_RE = re.compile(r"\b[\w']+\b")
sentimentfile = open('sentiment_word_list_stemmed.json') 

def mapper(self, _, line):
    stemmer = nltk.PorterStemmer()
    stems = json.loads(sentimentfile)

    line = line.strip()
    # each line is a json line
    data = json.loads(line)
    form = data.get('type', None)

    if form == 'review':
      bs_id = data.get('business_id', None)
      text = data['text']
      stars = data['stars']

      words = WORD_RE.findall(text)
      for word in words:
        w = stemmer.stem(word)
        senti = stems.get[w]

        if senti:
          yield (bs_id, (senti, 1))

You should not be opening a file in the mapper function at all. 您根本不应该在mapper函数中打开文件。 You only need to pass the file in as STDIN or as the first argument for the mapper to pick it up. 您只需要将文件作为STDIN传递或作为映射器的第一个参数来传递它。 Do it like this: 像这样做:

python mrjob_program.py sentiment_word_list_stemmed.json > output

OR 要么

python mrjob_program.py < sentiment_word_list_stemmed.json > output

Either one will work. 任何一个都可以工作。 It says that there is no such file or directory because these mappers are not able to see the file that you are specifying. 它表示没有这样的文件或目录,因为这些映射器无法看到您指定的文件。 The mappers are designed to run on remote machines. 映射器设计为在远程计算机上运行。 Even if you wanted to read from a file in the mapper you would need to copy the file that you are passing to all machines in the cluster which doesn't really make sense for this example. 即使您想要从映射器中的文件读取,您也需要将要传递的文件复制到群集中的所有计算机,这对于此示例并不合适。 You can actually specify a DEFAULT_INPUT_PROTOCOL so that the mapper know which type of input you are using as well. 您实际上可以指定DEFAULT_INPUT_PROTOCOL,以便映射器知道您正在使用哪种类型的输入。

Here is a talk on the subject that will help: 以下是有关该主题的讨论,将有助于:

http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/ http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/

You are using the json.loads() function, while passing in an open file. 您正在使用json.loads()函数,同时传入一个打开的文件。 Use json.load() instead (note, no s ). 改用json.load() (注意,没有s )。

stems = json.load(sentimentfile)

You do need to re-open the file every time you call your mapper() function, better just store the filename globally: 每次调用mapper()函数时都需要重新打开文件,最好只是全局存储文件名

sentimentfile = 'sentiment_word_list_stemmed.json'

def mapper(self, _, line):
    stemmer = nltk.PorterStemmer()
    stems = json.load(open(sentimentfile))

Last but not least, you should use a absolute path to the filename, and not rely on the current working directory being correct. 最后但并非最不重要的是,您应该使用文件名的绝对路径,而不是依赖当前正确的工作目录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM