![](/img/trans.png)
[英]MapReduce using hadoop streaming via python - Pass a list from mapper to reducer and Read it as a list
[英]Parsing HTML .txt files in Hadoop via MapReduce using Python
我對使用Hadoop平台和定義MapReduce函數是非常陌生的,並且我很難理解為什么該Mapper在我的MapReduce腳本中不起作用。 我試圖解析.txt文件中以字符串形式編寫的頁面集合,其中每個“行”都表示<page>...</page>
。 這個腳本有什么錯誤? 感謝您的幫助!
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import jobconf_from_env
import lxml
import mwparserfromhell
import heapq
import re
class MRParser(MRJob):
def mapper(self, _, line):
bigString = ''.join(re.findall(r'(<text xml:space="preserve">.*</text>)',line))
root = etree.fromstring(bigString.decode('utf-8'))
if not(bigString == ''):
bigString = etree.tostring(root,method='text', encoding = "UTF-8")
wikicode = mwparserfromhell.parse(bigString)
bigString = wikicode.strip_code()
yield None, bigString
def steps(self):
return [
MRStep(mapper=self.mapper)
]
您缺少減速器功能。 您需要將映射器中的行作為“鍵”(沒有值)傳遞給化簡器。 嘗試這個:
def mapper(self, _, line):
bigString = ''.join(re.findall(r'(<text xml:space="preserve">.*</text>)',line))
root = etree.fromstring(bigString.decode('utf-8'))
if not(bigString == ''):
bigString = etree.tostring(root,method='text', encoding = "UTF-8")
wikicode = mwparserfromhell.parse(bigString)
bigString = wikicode.strip_code()
yield bigString, None
def reducer(self, key, values):
yield key, None
def steps(self):
return [
MRStep(mapper=self.mapper, reducer=self.reducer)
]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.