简体   繁体   English

无法将Beam Python PCollection转换为列表

[英]Can't convert beam python pcollection into list

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list: 上面的错误是由于尝试将Pcollection转换为list而导致的:

filesList = (files | beam.combiners.ToList())

lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
            | 'map' >> beam.Map(_to_dictionary))

And: 和:

def ReadSHP(self, filesList):
    """
    """
    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])  

How to fix this problem? 如何解决这个问题? Any help is appreciated. 任何帮助表示赞赏。

In general you cannot convert a PCollection to a list. 通常,您不能将PCollection转换为列表。

PCollection is a collection of items that is potentially unbounded and is unordered. PCollection是可能无界且无序的项目的集合。 Beam allows you to apply transformations to a PCollection . Beam允许您将转换应用于PCollection Applying a PTransform to a PCollection yields another PCollection . PTransform应用于PCollection产生另一个PCollection And the process of application of a transformation is potentially distributed over a fleet of machines. 转换的应用过程可能会分布在一组机器上。 So it is impossible in general case to convert such a thing into a collection of elements in local memory. 因此,一般情况下不可能将此类事物转换为本地内存中的元素集合。

Combiners is just a special class of PTransforms . 组合器只是PTransforms的特殊类。 What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. 他们要做的是累积所有看到的元素,对元素应用一些合并逻辑,然后输出合并结果。 For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. 例如,组合器可以查看传入的元素,将它们求和,然后输出总和作为结果。 Such combiner transforms a PCollection of elements into a PCollection of sums of those elements. 这种组合变换一个PCollection元件成PCollection这些元素的总和的。

beam.combiners.ToList is just another transformation that is applied to a PCollection , potentially over a fleet of worker machines, and yields another PCollection . beam.combiners.ToList仅仅是被施加到另一个变换PCollection ,潜在地在工人机的车队,并产生另一个PCollection But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. 但是在产生输出元素之前,它实际上并没有进行任何复杂的组合,它只将所有可见元素累积到一个列表中,然后输出可见元素的列表。 So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists. 因此,它将作为键值对的元素(在多台计算机上)放入列表中,然后输出这些列表。

What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. 缺少从潜在多台计算机中获取这些列表并将它们加载到本地程序(如果需要)的逻辑。 That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures). 以通用方式(在所有运行程序,所有可能的IO和管道结构之间)无法轻松解决该问题(如果有的话)。

One of the workarounds is to add another step to the pipeline that writes the combined outputs (eg the sums, or the lists) into a common storage, eg a table in some database, or a file. 解决方法之一是在管道中添加另一步,将合并的输出(例如,总和或列表)写入公用存储(例如某个数据库中的表或文件)。 And then when the pipeline finishes your program can load the results of the pipeline execution from that place. 然后,当管道完成时,您的程序可以从该位置加载管道执行的结果。

See the documentation for details: 有关详细信息,请参见文档:

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. 另一种选择是使用GCE VM,并使用ogr2​​ogr之类的工具将shapefile转换为GeoJSON。 The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS. 然后可以将GeoJSON加载到BigQuery中,并可以使用BigQuery GIS进行查询。

Here is a blogpost with more details 这是一个具有更多详细信息的博客文章
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8 https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache 光束列表到 PCollection - Apache beam list to PCollection 我可以使用 python 对 Apache beam PCollection 中的项目进行排序吗? - Can I sort the items in an Apache beam PCollection using python? 如何将梁数据帧转换回 PCollection? - How to convert beam dataframe back to PCollection? 是否可以有效地将 PCollection 列表转换为 PCollection(只是列表中的值)? - Is it possible to efficiently convert PCollection list to PCollection (just values from the list)? Apache 中的分支和合并 pcollection 列表来自公共输入 - Branching and Merging pcollection list in Apache Beam from common input 从单个 PCollection 写入多个文件(Beam-Python) - Writing to Multiple Files from Single PCollection (Beam-Python) 如何从PCollection Apache Beam Python创建N个元素组 - How to create groups of N elements from a PCollection Apache Beam Python python中的Apache Beam:如何在另一个PCollection上重用完全相同的转换 - Apache Beam in python: How to reuse exactly the same transform on another PCollection 如何使用Apache Beam在Python中将有界pcollection转换为无界? - How to transform bounded pcollection to unbounded in Python with Apache Beam? 使用python的Apache Beam中PCollection内几个字段的最大值和最小值 - Max and Min for several fields inside PCollection in apache beam with python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM