简体   繁体   中英

Write to Cloud Storage from Dataflow with dynamic datetime

We're building an Apache Beam (Java SDK) pipeline that writes to Cloud Storage. We're using the TextIO.write() transform to write to storage. Within this operation, we would like to dynamically change the subdirectory where the file is stored based on the current datetime.

This is part of a streaming pipeline. Ideally, we would like to deploy it and let the Beam job dynamically change the folder subdirectory where it saves the file based on the processing datetime.

Our current pipeline transform looks like this:

DateTimeFormatter dtfOut = DateTimeFormat.forPattern("yyyy/MM/dd");

myPCollection.apply(TextIO.write().to("gs://my-bucket/%s", dtfOut.print(new DateTime()));

The problem with this code is that, the DateTime value returned by the function is stuck at the same value as when the pipeline is deployed to Google Cloud Dataflow. We would like to dynamically change the subdirectory structure based on the datetime at the time of processing an incoming message.

I was going to:

  1. Get the datetime from a ParDo function.
  2. Create a new ParDo function and use the message as a main input and pass the datetime from the other ParDo function as side input.

Is this the best approach? Are there built-in tools in Apache Beam that can solve our use-case?

The FileIO provides the writeDynamic() method, which allows each item of the pCollection to be directed to a different directory or file given the contents of the item itself.

I'm placing below a simple example I created just to demonstrate:

public class ExamplePipeline {
public static void main(String[] args) {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
    Pipeline pipeline = Pipeline.create(options);


    Create.Values<String> sampleData = Create.of("ABC","DEF", "GHI");

    pipeline.apply(sampleData)
            .apply("WritingDynamic", FileIO.<PartitionData, String>writeDynamic()
                    .by(event -> new PartitionData())
                    .withDestinationCoder(AvroCoder.of(PartitionData.class))
                    .via(Contextful.fn(event -> event), TextIO.sink())
                    .to("gs://my-bucket/")
                    .withNaming(partitionData -> FileIO.Write.defaultNaming(partitionData.getPath(), ".txt")));

    pipeline.run().waitUntilFinish();
}

public static class PartitionData implements Serializable {
    private static final long serialVersionUID = 4549174774076944665L;

    public String getPath() {
        LocalDateTime writtingMoment = LocalDateTime.now(ZoneOffset.UTC);
        int year = writtingMoment.getYear();
        int month = writtingMoment.getMonthValue();
        int day = writtingMoment.getDayOfMonth();

        return String.format("%d/%02d/%02d/", year, month, day);
    }
}

The code above will save into a structure: gs://my-bucket/${year}/%{month}/%{day}/... .txt

In the by() method I used my data partitioner, which I called PartitionData .

The via() should return what you want to be finally written as the content.

The to() is going to be the base part of your path.

And in withNaming you really construct the final part of your path.

Normally I would have timestamps on the events themselves telling when they happened in reality, then you could get from the event instead of using a LocalDateTime.now .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM