If I have a .txt file stored on GCS containing a list of words that will be used as part of a beam.Filter
can this list be accessed dynamically within my apache beam pipeline? I know that I can define this list as a global variable within the pipeline, but I'm not sure how to read in the whole file into a list and if there are any beam tricks to accomplish this. Any suggestions? Here is my current implementation which is not working..
def boolean_terms(word, term_list):
if word in term_list:
return (word, 1)
else:
return (word, 0)
# side table
filter_terms = p | beam.io.ReadFromText(path_to_gcs_txt_file)
words = ...
filtered_words = words | beam.FlatMap(lambda x:
[boolean_terms(word, filter_terms) for word in x])
I get the following error "TypeError: argument of type '_InvalidUnpickledPCollection' is not iterable"
You can access the list of words as a side input . I believe the beam.Filter
transform supports usage of side inputs from the filter function in exactly the same way as FlatMap
and ParDo
in the examples by that link.
Something like:
words | beam.Filter(lambda x, filter_terms: word in filter_terms,
filter_terms=pvalue.AsList(p | beam.io.ReadFromText(path)))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.