Can't convert beam python pcollection into list

Question

TypeError: 'PCollection' object does not support indexing

Above error results from trying to convert Pcollection into list:

filesList = (files | beam.combiners.ToList())

lines = (p | 'read' >> beam.Create(ReadSHP().ReadSHP(filesList))
            | 'map' >> beam.Map(_to_dictionary))

And:

def ReadSHP(self, filesList):
    """
    """
    sf = shp.Reader(shp=filesList[1], dbf=filesList[2])

How to fix this problem? Any help is appreciated.

Answer 1

In general you cannot convert a PCollection to a list.

PCollection is a collection of items that is potentially unbounded and is unordered. Beam allows you to apply transformations to a PCollection . Applying a PTransform to a PCollection yields another PCollection . And the process of application of a transformation is potentially distributed over a fleet of machines. So it is impossible in general case to convert such a thing into a collection of elements in local memory.

Combiners is just a special class of PTransforms . What they do is they accumulate all the elements they see, apply some combining logic to the elements, and then output the result of combining. For example, a combiner could look at the incoming elements, sum them up, and then output the sum as a result. Such combiner transforms a PCollection of elements into a PCollection of sums of those elements.

beam.combiners.ToList is just another transformation that is applied to a PCollection , potentially over a fleet of worker machines, and yields another PCollection . But it doesn't really do any complex combining before yielding the output elements, it only accumulates all of the seen elements into a list and then outputs the list of seen elements. So, it takes the elements that are key-value pairs (on multiple machines), puts them into lists, and outputs those lists.

What is missing is the logic to take those lists from potentially multiple machines and load them into your local program if you need. That problem cannot be easily (if at all) solved in a generic way (between all the runners, all possible IOs and pipeline structures).

One of the workarounds is to add another step to the pipeline that writes the combined outputs (eg the sums, or the lists) into a common storage, eg a table in some database, or a file. And then when the pipeline finishes your program can load the results of the pipeline execution from that place.

See the documentation for details:

Beam Execution model: https://beam.apache.org/documentation/execution-model/
Designing Your Pipeline: https://beam.apache.org/documentation/pipelines/design-your-pipeline/
Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/

Answer 2

An alternative option will be to use GCE VM and convert the shapefiles to GeoJSON using tools like ogr2ogr. The GeoJSON can then be loaded into BigQuery and can be queried using BigQuery GIS.

Here is a blogpost with more details
https://medium.com/google-cloud/how-to-load-geographic-data-like-zipcode-boundaries-into-bigquery-25e4be4391c8

Can't convert beam python pcollection into list

Question

2 answers

solution1
4 ACCPTED 2018-12-05 17:30:50

solution2
0 2018-12-18 08:28:14

Can't convert beam python pcollection into list

Question

2 answers

solution1 4 ACCPTED 2018-12-05 17:30:50

solution2 0 2018-12-18 08:28:14

solution1
4 ACCPTED 2018-12-05 17:30:50

solution2
0 2018-12-18 08:28:14