簡體   English   中英

Apache Beam:根據鍵將鍵值對的值寫入文件

[英]Apache Beam: Writing values of key,value pair to files according to key

我想使用 Apache Beam 中的FileIOwriteDynamic()將鍵值對中的值寫入 GCS 中的文本文件(使用 Java)。

到目前為止,我正在從 Big Query 讀取數據,將其轉換為鍵、值對,然后嘗試使用 FileIO 和writeDynamic()將值寫入每個鍵的一個文件中。

PCollection<TableRow> inputRows = p.apply(BigQueryIO.readTableRows()
    .from(tableSpec)
    .withMethod(Method.DIRECT_READ)
    .withSelectedFields(Lists.newArrayList("id", "string1", "string2", "string3", "int1")));

inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
    .via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("gs://bucket/output")
    .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

我得到錯誤:

The method apply
  (PTransform<? super PCollection<KV<Integer,String>>,OutputT>) 
  in the type PCollection<KV<Integer,String>> 
  is not applicable for the arguments 
  (FileIO.Write<String,KV<String,String>>)

存在類型不匹配。 請注意, TableRow元素被解析為MapElements中的KV<Integer, String> (即鍵是Integer )。 然后,寫入步驟需要一個String鍵,如.apply(FileIO.<String, KV<String, String>>writeDynamic()

inputRows.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptors.strings()))
    .via(tableRow -> KV.of((Integer) tableRow.get("id"),(String) tableRow.get("string1"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
    .by(KV::getKey)
    ...

為了避免在使用.by(KV::getKey)時必須再次轉換密鑰,我建議在之前將其轉換為String

inputRows
    .apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
        .via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
    .apply(FileIO.<String, KV<String, String>>writeDynamic()
        .by(KV::getKey)

作為示例,我使用公共表bigquery-public-data:london_bicycles.cycle_stations對此進行了測試,其中我將每個自行車站寫入不同的文件:

$ cat output/file-746-00000-of-00004.txt 
Lots Road, West Chelsea

$ bq query --use_legacy_sql=false "SELECT name FROM \`bigquery-public-data.london_bicycles.cycle_stations\` WHERE id = 746"
Waiting on bqjob_<ID> ... (0s) Current status: DONE   
+-------------------------+
|          name           |
+-------------------------+
| Lots Road, West Chelsea |
+-------------------------+

完整代碼:

package com.dataflow.samples;

import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.transforms.Contextful;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists;


public abstract class DynamicGCSWrites {

    public interface Options extends PipelineOptions {
        @Validation.Required
        @Description("Output Path i.e. gs://BUCKET/path/to/output/folder")
        String getOutput();
        void setOutput(String s);
    }

    public static void main(String[] args) {

        DynamicGCSWrites.Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DynamicGCSWrites.Options.class);

        Pipeline p = Pipeline.create(options);

        String output = options.getOutput();

        PCollection<TableRow> inputRows = p
            .apply(BigQueryIO.readTableRows()
                .from("bigquery-public-data:london_bicycles.cycle_stations")
                .withMethod(Method.DIRECT_READ)
                .withSelectedFields(Lists.newArrayList("id", "name")));

        inputRows
            .apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
                .via(tableRow -> KV.of((String) tableRow.get("id"),(String) tableRow.get("name"))))
            .apply(FileIO.<String, KV<String, String>>writeDynamic()
                .by(KV::getKey)
                .withDestinationCoder(StringUtf8Coder.of())
                .via(Contextful.fn(KV::getValue), TextIO.sink())
                .to(output)
                .withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));

        p.run().waitUntilFinish();
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM