![](/img/trans.png)
[英]How to handle lack of messages from Pub/Sub in Google Dataflow
[英]Process file from a Pub/Sub message in Dataflow streaming
我想部署一个正在监听 Pub/Sub 主题的流式数据流作业。
Pub/Sub 消息内容如下所示:
{
"file_path": "gs://my_bucket_name/my_file.csv",
"transformations": [
{
"column_name": "NAME",
"transformation": "to_upper"
},
{
"column_name": "SURNAME",
"transformation": "to_lower"
}
]
}
我的问题是我想处理消息( file_path
)指定的文件,并对 CSV 文件中的每一列应用给定的转换。
我已经尝试了几种方法来实现这一点,但它们都没有奏效,我想知道这是否根本不可能,或者我错过了一些东西。
class ProcessMessage(beam.DoFn):
def process(self, message):
from apache_beam.pvalue import TaggedOutput
try:
file_path = message.get('file_path')
yield TaggedOutput('file_path', file_path)
except Exception as e:
raise Exception(e)
with beam.Pipeline(options=pipeline_options) as p:
file_path = (
p | "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=input_topic,timestamp_attribute='ts')
| "Parse JSON" >> beam.Map(json.loads)
| "Process Message" >> beam.ParDo(ProcessMessage).with_outputs('file_path')
)
file_content = (
p
| "Read file" >> beam.io.ReadFromText(file_path)
)
这失败了: file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead
file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead
class ReadFile(beam.DoFn):
def process(self, element):
import csv
import io as io_file
from apache_beam import io
file_path = element.get('file_path')
reader = csv.DictReader(io_file.TextIOWrapper(
io.filesystems.FileSystems.open(file_path),
encoding='utf-8'),
delimiter=';')
for row in reader:
yield row
with beam.Pipeline(options=pipeline_options) as p:
message = (
p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
topic=pipeline_config.get('input_topic'),
timestamp_attribute='ts')
| "Parse JSON" >> beam.Map(json.loads)
| "Process message" >> beam.ParDo(ProcessMessage())
)
file_content = (
message
| beam.ParDo(ReadFile())
| beam.Map(print)
)
这不会产生任何错误,也不会打印文件行。
我知道这篇文章有点长,但我希望有人可以帮助我,
谢谢!
第一个解决方案不起作用,因为ReadFromText
将作为参数字符串,例如存储桶路径“gs://bucket/file”。 在您的示例中,您插入到此 class PCollection(先前 PTransform 的结果)中 - 所以它不起作用。 相反,您应该使用将ReadAllFromText
作为输入的 ReadAllFromText,因此它是先前 PTransform 的结果。
此外,您的代码需要稍作修改:
如果 DoFn class 只返回一种类型的 output,则没有理由使用 TaggedOutput,所以让我们只返回常规迭代器。
class ProcessMessage(beam.DoFn):
def process(self, message):
try:
file_path = message.get('file_path')
yield file_path
except Exception as e:
raise Exception(e)
接下来, ReadAllFromText
应该连接到管道的上一步,而不是p
。
file_content = (
p
| "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=p.options.topic, timestamp_attribute='ts')
| "Parse JSON" >> beam.Map(json.loads)
| "Process Message" >> beam.ParDo(ProcessMessage())
| "Read file" >> beam.io.ReadAllFromText()
)
请注意, file_content
变量将是元素的 Pcollection,其中每个元素将以字符串形式出现在 CSV 文件的单行中。 因此,为每列轻松应用转换将更加复杂,因为在第一个元素中将是列名,下一个将只是没有应用列名的单行。
您的第二次尝试似乎对此更好:
class ApplyTransforms(beam.DoFn):
def process(self, element):
file_path = element.get('file_path')
transformations = element.get('transformations')
with beam.io.gcsio.GcsIO().open(file_path) as file:
reader = csv.DictReader(io.TextIOWrapper(file, encoding="utf-8"), delimiter=';')
for row in reader:
for transform in transformations:
col_name = transform.get("column_name")
transformation = transform.get("transformation")
# apply your transform per row
yield row
像这样的东西可以工作,但可能更好的主意是将它分成两类 - 一类用于阅读,另一类用于应用转换:)
感谢@Pav3k 的回答,我能够解决问题。 我的代码现在已解耦,如下所示:
class MyMessage(typing.NamedTuple):
# Simple way to propagate all the needed information from the Pub/Sub message.
file_path: str
transformations: dict
class ProcessMessage(beam.DoFn):
def process(self, message):
"""
Example of the Pub/Sub message
{
"file_path": "gs://my-bucket/file_to_process.csv",
"transformations": {
"col_1": "to_upper",
"col_2": "to_lower"
}
}
"""
yield MyMessage(file_path=message.get('file_path'),
transformations=message.get('transformations'))
class ReadFile(beam.DoFn):
def process(self, element: MyMessage):
import csv
import io as io_file
from apache_beam import io
reader = csv.DictReader(io_file.TextIOWrapper(
io.filesystems.FileSystems.open(MyMessage.file_path),
encoding='utf-8'),
delimiter=';')
for row in reader:
# Yields both the row to process and the transformations.
yield (row, MyMessage.transformations)
class Transform(beam.ParDo):
def to_upper(self, value):
return value.upper()
def to_lower(self, value):
return value.lower()
def process(self, element):
"""
Now I now the transformations for each element and may be parallelized.
"""
row = element[0]
transformations = element[1]
transformed_row = {}
for key in transformations:
value = row[key]
transformation = transformations[key]
transformed_row[key] = getattr(self, transformation)(value)
yield transformed_row
def main(argv):
parser = argparse.ArgumentParser()
parser.add_argument("--topic_name", required=True)
app_args, pipeline_args = parser.parse_known_args()
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
message = (
p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
topic=app_args.topic_name,
timestamp_attribute='ts')
| "Parse JSON" >> beam.Map(json.loads)
| "Process message" >> beam.ParDo(ProcessMessage())
)
file_content = (
message
| beam.ParDo(ReadFile())
| beam.ParDo(Transform())
| beam.Map(print)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.