简体   繁体   English

在函数上火花并行化

[英]spark parallelise on iterator with a function

I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document: 我有一个迭代器,可以对WARC文档序列进行操作,并为每个文档生成经过修改的令牌列表:

class MyCorpus(object):
def __init__(self, warc_file_instance):
    self.warc_file = warc_file_instance
def clean_text(self, html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text
def __iter__(self):
    for r in self.warc_file:
        try:
            w_trec_id = r['WARC-TREC-ID']
            print w_trec_id
        except KeyError:
            pass
        try:
            text = self.clean_text(re.compile('Content-Length: \d+').split(r.payload)[1])
            alnum_text = re.sub('[^A-Za-z0-9 ]+', ' ', text)
            yield list(set(alnum_text.encode('utf-8').lower().split()))
        except:
            print 'An error occurred'

Now I apply apache spark paraellize to further apply desired map functions: 现在,我将apache spark paraellize应用于进一步应用所需的地图功能:

warc_file = warc.open('/Users/akshanshgupta/Workspace/00.warc')
documents = MyCorpus(warc_file) 
x = sc.parallelize(documents, 20)
data_flat_map = x.flatMap(lambda xs: [(x, 1) for x in xs])
sorted_map = data_flat_map.sortByKey()
counts = sorted_map.reduceByKey(add)
print(counts.max(lambda x: x[1]))

I have following doubts: 我有以下疑问:

  1. Is this the best way to achieve this or there is a simpler way? 这是实现此目标的最佳方法,还是有更简单的方法?
  2. When I parallelise the iterator does the actual processing happen in parallel? 当我并行化迭代器时,实际处理是否并行发生? Is is still sequential? 仍然是顺序的吗?
  3. What if I have multiple files? 如果我有多个文件怎么办? How can I scale this to a very large corpus say TB's? 我如何才能将其扩展到很大的语料库,例如TB?

More from Scala context, but: 更多来自Scala上下文的信息,但是:

  1. One doubt I have is doing sortByKey before reduceByKey. 我有一个疑问是在reduceByKey之前先进行sortByKey。
  2. Processing is in parallel if using map, foreachPartition, Dataframe Writer, etc. or reading via sc and sparksession, and the Spark paradigm is generally suited to non-sequential dependent algorithms. 如果使用map,foreachPartition,Dataframe Writer等或通过sc和sparksession进行读取,则处理是并行的,并且Spark范式通常适用于非顺序依赖算法。 mapPartitions and other APIs generally used for improving performance. mapPartitions和其他通常用于提高性能的API。 That function should be part of mapPartitions I would think or used in conjunction with map or within map closure. 该函数应该是mapPartitions的一部分,我认为该函数或与map一起使用或在map闭包内使用。 Note serializable issues, see: 注意可序列化的问题,请参阅:

  3. More computer resources allows more scaling with better performance, throughput. 更多的计算机资源可以实现更好的性能和吞吐量的更多扩展。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM