簡體   English   中英

MapReduce Job(用python編寫)在EMR上運行緩慢

[英]MapReduce Job (written in python) run slow on EMR

我正在嘗試使用python的MRJob包編寫MapReduce作業。 作業處理約36,000個存儲在S3中的文件。 每個文件約為2MB。 當我在本地運行作業(將S3存儲桶下載到我的計算機上)時,大約需要1個小時才能運行。 但是,當我嘗試在EMR上運行它時,會花費更長的時間(我在8小時內將其停止,並且在映射器中完成了10%)。 我已在下面附加了我的mapper_init和mapper的代碼。 有誰知道會導致這樣的問題? 有誰知道如何修理它? 我還應該注意,當我將輸入限制為100個文件的樣本時,它可以正常工作。

def mapper_init(self):
    """
    Set class variables that will be useful to our mapper:
        filename: the path and filename to the current recipe file
        previous_line: The line previously parsed. We need this because the
          ingredient name is in the line after the tag
    """

    #self.filename = os.environ["map_input_file"]  # Not currently used
    self.previous_line = "None yet"
    # Determining if an item is in a list is O(n) while determining if an
    #  item is in a set is O(1)
    self.stopwords = set(stopwords.words('english'))
    self.stopwords = set(self.stopwords_list)


def mapper(self, _, line):
    """
    Takes a line from an html file and yields ingredient words from it

    Given a line of input from an html file, we check to see if it
    contains the identifier that it is an ingredient. Due to the
    formatting of our html files from allrecipes.com, the ingredient name
    is actually found on the following line. Therefore, we save the
    current line so that it can be referenced in the next pass of the
    function to determine if we are on an ingredient line.

    :param line: a line of text from the html file as a str
    :yield: a tuple containing each word in the ingredient as well as a
        counter for each word. The counter is not currently being used,
        but is left in for future development. e.g. "chicken breast" would
        yield "chicken" and "breast"
    """

    # TODO is there a better way to get the tag?
    if re.search(r'span class="ingredient-name" id="lblIngName"',
                 self.previous_line):
        self.previous_line = line
        line = self.process_text(line)
        line_list = set(line.split())
        for word in line_list:
            if word not in self.stopwords:
                yield (word, 1)
    else:
        self.previous_line = line
    yield ('', 0)

問題是您有更多的小文件。 使用s3distcp添加引導步驟以將文件復制到EMR。 在使用s3distcp時,嘗試將小文件聚合為〜128MB文件。

Hadoop對於大量小文件不是很好。

由於您是手動將文件下載到計算機上並正在運行,因此運行速度更快。

使用S3distCP將文件復制到EMR后,請使用HDFS中的文件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM