[英]spark parallelise on iterator with a function
I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document: 我有一个迭代器,可以对WARC文档序列进行操作,并为每个文档生成经过修改的令牌列表:
class MyCorpus(object):
def __init__(self, warc_file_instance):
self.warc_file = warc_file_instance
def clean_text(self, html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
def __iter__(self):
for r in self.warc_file:
try:
w_trec_id = r['WARC-TREC-ID']
print w_trec_id
except KeyError:
pass
try:
text = self.clean_text(re.compile('Content-Length: \d+').split(r.payload)[1])
alnum_text = re.sub('[^A-Za-z0-9 ]+', ' ', text)
yield list(set(alnum_text.encode('utf-8').lower().split()))
except:
print 'An error occurred'
Now I apply apache spark paraellize to further apply desired map functions: 现在,我将apache spark paraellize应用于进一步应用所需的地图功能:
warc_file = warc.open('/Users/akshanshgupta/Workspace/00.warc')
documents = MyCorpus(warc_file)
x = sc.parallelize(documents, 20)
data_flat_map = x.flatMap(lambda xs: [(x, 1) for x in xs])
sorted_map = data_flat_map.sortByKey()
counts = sorted_map.reduceByKey(add)
print(counts.max(lambda x: x[1]))
I have following doubts: 我有以下疑问:
More from Scala context, but: 更多来自Scala上下文的信息,但是:
Processing is in parallel if using map, foreachPartition, Dataframe Writer, etc. or reading via sc and sparksession, and the Spark paradigm is generally suited to non-sequential dependent algorithms. 如果使用map,foreachPartition,Dataframe Writer等或通过sc和sparksession进行读取,则处理是并行的,并且Spark范式通常适用于非顺序依赖算法。 mapPartitions and other APIs generally used for improving performance.
mapPartitions和其他通常用于提高性能的API。 That function should be part of mapPartitions I would think or used in conjunction with map or within map closure.
该函数应该是mapPartitions的一部分,我认为该函数或与map一起使用或在map闭包内使用。 Note serializable issues, see:
注意可序列化的问题,请参阅:
More computer resources allows more scaling with better performance, throughput. 更多的计算机资源可以实现更好的性能和吞吐量的更多扩展。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.