简体   繁体   English

MapReduce Job(用python编写)在EMR上运行缓慢

[英]MapReduce Job (written in python) run slow on EMR

I am trying to write a MapReduce job using python's MRJob package. 我正在尝试使用python的MRJob包编写MapReduce作业。 The job processes ~36,000 files stored in S3. 作业处理约36,000个存储在S3中的文件。 Each file is ~2MB. 每个文件约为2MB。 When I run the job locally (downloading the S3 bucket to my computer) it takes approximately 1 hour to run. 当我在本地运行作业(将S3存储桶下载到我的计算机上)时,大约需要1个小时才能运行。 However, when I try to run it on EMR, it takes much longer (I stopped it at 8 hours and it was 10% complete in the mapper). 但是,当我尝试在EMR上运行它时,会花费更长的时间(我在8小时内将其停止,并且在映射器中完成了10%)。 I have attached the code for my mapper_init and mapper below. 我已在下面附加了我的mapper_init和mapper的代码。 Does anyone know what would cause an issue like this? 有谁知道会导致这样的问题? Does anyone know how to fix it? 有谁知道如何修理它? I should also note that when I limit the input to a sample of 100 files it works fine. 我还应该注意,当我将输入限制为100个文件的样本时,它可以正常工作。

def mapper_init(self):
    """
    Set class variables that will be useful to our mapper:
        filename: the path and filename to the current recipe file
        previous_line: The line previously parsed. We need this because the
          ingredient name is in the line after the tag
    """

    #self.filename = os.environ["map_input_file"]  # Not currently used
    self.previous_line = "None yet"
    # Determining if an item is in a list is O(n) while determining if an
    #  item is in a set is O(1)
    self.stopwords = set(stopwords.words('english'))
    self.stopwords = set(self.stopwords_list)


def mapper(self, _, line):
    """
    Takes a line from an html file and yields ingredient words from it

    Given a line of input from an html file, we check to see if it
    contains the identifier that it is an ingredient. Due to the
    formatting of our html files from allrecipes.com, the ingredient name
    is actually found on the following line. Therefore, we save the
    current line so that it can be referenced in the next pass of the
    function to determine if we are on an ingredient line.

    :param line: a line of text from the html file as a str
    :yield: a tuple containing each word in the ingredient as well as a
        counter for each word. The counter is not currently being used,
        but is left in for future development. e.g. "chicken breast" would
        yield "chicken" and "breast"
    """

    # TODO is there a better way to get the tag?
    if re.search(r'span class="ingredient-name" id="lblIngName"',
                 self.previous_line):
        self.previous_line = line
        line = self.process_text(line)
        line_list = set(line.split())
        for word in line_list:
            if word not in self.stopwords:
                yield (word, 1)
    else:
        self.previous_line = line
    yield ('', 0)

The problem is you have more number of small files. 问题是您有更多的小文件。 Add bootstrap step using s3distcp to copy files to EMR. 使用s3distcp添加引导步骤以将文件复制到EMR。 while using s3distcp try to aggregate small files into ~128MB file. 在使用s3distcp时,尝试将小文件聚合为〜128MB文件。

Hadoop is not good with large number of small files. Hadoop对于大量小文件不是很好。

Since you are manually downloading files to your computer and running hence it run faster. 由于您是手动将文件下载到计算机上并正在运行,因此运行速度更快。

Once you copy file to EMR using S3distCP use file from HDFS. 使用S3distCP将文件复制到EMR后,请使用HDFS中的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM