简体   繁体   中英

Google Cloud Dataflow access .txt file on cloud storage

If I have a .txt file stored on GCS containing a list of words that will be used as part of a beam.Filter can this list be accessed dynamically within my apache beam pipeline? I know that I can define this list as a global variable within the pipeline, but I'm not sure how to read in the whole file into a list and if there are any beam tricks to accomplish this. Any suggestions? Here is my current implementation which is not working..

def boolean_terms(word, term_list):
  if word in term_list:
    return (word, 1)
  else:
    return (word, 0)

# side table
filter_terms = p | beam.io.ReadFromText(path_to_gcs_txt_file)

words = ...

filtered_words = words | beam.FlatMap(lambda x: 
    [boolean_terms(word, filter_terms) for word in x])

I get the following error "TypeError: argument of type '_InvalidUnpickledPCollection' is not iterable"

You can access the list of words as a side input . I believe the beam.Filter transform supports usage of side inputs from the filter function in exactly the same way as FlatMap and ParDo in the examples by that link.

Something like:

words | beam.Filter(lambda x, filter_terms: word in filter_terms,
                    filter_terms=pvalue.AsList(p | beam.io.ReadFromText(path)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM