简体   繁体   English

使用带有 header 的 postgres 中的 TextIO.write() 写入 GCS

[英]Write to GCS using TextIO.write() from postgres with header

I am having a pipeline be run on GCP Dataflow where I read from an SQL instance and collect the data in a PCollection and then write that PCollection to a CSV file.我在 GCP Dataflow 上运行了一个管道,我从 SQL 实例读取数据并将数据收集到 PCollection 中,然后将该 PCollection 写入 CSV 文件。 It seems that while writing to CSV I cannot pass the header at Runtime (as a valueprovider) as given here the header has to be a string argument.似乎在写入 CSV 时,我无法在运行时传递 header(作为值提供者),如此处给出header 必须是字符串参数。

I have tried giving an empty string and updating the string in runtime, but it doesn't work.我试过给一个空字符串并在运行时更新字符串,但它不起作用。 I take the first empty string as header only.我只将第一个空字符串作为 header 。

Is there any way that I can generate the header inside and have that string as header or if I can pass the header as a runtime argument?有什么方法可以在内部生成 header 并将该字符串作为 header 或者如果我可以将 header 作为运行时参数传递?

Attaching the textio code below附上下面的textio代码

String header = /*header*/;
PCollection<String> output = /*jdbc result*/;

output
    .apply(
        "Write File(s)",
        TextIO.write()
            .to(options.getFilePath())
            .withSuffix(".csv")
            .withHeader(header)
            .withShardNameTemplate("-S-of-N")
            .withTempDirectory(options.getTempDirectory()))

I don't understand the problem, I think you can pass a program argument as String:我不明白这个问题,我认为您可以将程序参数作为字符串传递:

--header=test

Options in Java code: Java代码中的选项:

public interface MyOptions extends PipelineOptions {

    @Description("Header")
    String getHeader();

    void setHeader(String value);
}

Then pass it in the withHeader(header) method:然后在withHeader(header)方法中传递它:

output
    .apply(
        "Write File(s)",
        TextIO.write()
            .to(options.getFilePath())
            .withSuffix(".csv")
            .withHeader(options.getHeader())
            .withShardNameTemplate("-S-of-N")
            .withTempDirectory(options.getTempDirectory()))

If you want, you can also configure the header outside in your code.如果你愿意,你也可以在你的代码之外配置header

Currently withHeader is an argument that has to be specified at construction time, so it cannot be provided using PCollection element values.目前withHeader是一个必须在构造时指定的参数,因此不能使用PCollection元素值提供它。

You might be able to do this by breaking your pipeline into two pipelines, or generating/discovering the header value within your program from where the Beam pipeline is started.您可以通过将您的管道分成两个管道,或者在您的程序中从 Beam 管道开始的位置生成/发现 header 值来做到这一点。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 TextIO.write() 写入 S3 时出错(谷歌数据流) - Error writing to S3 using TextIO.write() (Google Dataflow) 无需凭据/身份验证即可从 GKE Pod 进行 GCS 读写访问 - GCS read write access from the GKE Pod without credentials/auth Airflow:我将如何编写一个 Python 运算符以从 BigQuery 提取 function 到 GCS function? - Airflow: How would I write a Python operator for an extract function from BigQuery to GCS function? 使用 Dataflow 的 DLP 从 GCS 读取并写入 BigQuery - 只有 50% 的数据写入 BigQuery - Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery Apache 光束 dataframe 将 csv 写入没有分片名称模板的 GCS - Apache beam dataframe write csv to GCS without shard name template Dataflow 作业中的 BigQueryIO.write() 步骤在 TextIO.Read() 之前运行,导致空指针异常 - BigQueryIO.write() step in the Dataflow job runs before TextIO.Read() causing Null pointer exception 读取美国地区 GCS 存储桶以将数据写入欧洲地区存储桶的最佳方法是什么 - what is the best way to read US region GCS bucket to write the data into Europe region bucket 使用 PIL 模块从 GCS 打开文件 - Using PIL module to open file from GCS 使用 Cloud Functions 中的服务帐户将数据从 GCS 上传到 Bigquery - Upload data from GCS to Bigquery using service account in Cloud Functions 如何使用 spark-java 从 GCS 读取 csv 文件? - How to read csv file from GCS using spark-java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM