GCP Dataflow 2.0 PubSub到GCS

Question

I'm having a difficult time understanding the concepts of .withFileNamePolicy of TextIO.write(). 我很难理解TextIO.write（）的.withFileNamePolicy的概念。 The requirements for supplying a FileNamePolicy seem incredibly complex for doing something as simple as specifying a GCS bucket to write streamed filed. 提供FileNamePolicy的要求似乎非常复杂，因为它可以像指定GCS存储桶来编写流式字段一样简单。

At a high level, I have JSON messages being streamed to a PubSub topic, and I'd like to write those raw messages to files in GCS for permanent storage (I'll also be doing other processing on the messages). 在高层次上，我将JSON消息流式传输到PubSub主题，并且我想将这些原始消息写入GCS中的文件以进行永久存储（我还将对消息进行其他处理）。 I initially started with this Pipeline, thinking it would be pretty simple: 我最初开始使用这个Pipeline，认为这很简单：

public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

        Pipeline p = Pipeline.create(options); 

        p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
            .apply("Write to GCS", TextIO.write().to(gcs_bucket);

        p.run();

    }

I got the error about needing WindowedWrites, which I applied, and then needing a FileNamePolicy. 我收到了需要WindowedWrites的错误，我申请了，然后需要FileNamePolicy。 This is where things get hairy. 这是事情变得多毛的地方。

I went to the Beam docs and checked out FilenamePolicy . 我去了梁文档并检查了FilenamePolicy 。 It looks like I would need to extend this class which then also require extending other abstract classes to make this work. 看起来我需要扩展这个类，然后还需要扩展其他抽象类来使其工作。 Unfortunately the documentation on Apache is a bit scant and I can't find any examples for Dataflow 2.0 doing this, except for The Wordcount Example , which even then uses implements these details in a helper class. 不幸的是，关于Apache的文档有点不足，我找不到Dataflow 2.0这样做的任何示例，除了Wordcount示例，它甚至用于在帮助器类中实现这些细节。

So I could probably make this work just by copying much of the WordCount example, but I'm trying to better understand the details of this. 所以我可以通过复制WordCount的大部分示例来完成这项工作，但我正在努力更好地理解这个细节。 A few questions I have: 我有几个问题：

1) Is there any roadmap item to abstract a lot of this complexity? 1）是否有任何路线图项目可以抽象出很多这种复杂性？ It seems like I should be able to do supply a GCS bucket like I would in a nonWindowedWrite, and then just supply a few basic options like the timing and file naming rule. 看起来我应该像在非WindowsWrite中一样提供GCS存储桶，然后只提供一些基本选项，如时序和文件命名规则。 I know writing streaming windowed data to files is more complex than just opening a file pointer (or object storage equivalent). 我知道将流窗口数据写入文件比打开文件指针（或对象存储等效）更复杂。

2) It looks like to make this work, I need to create a WindowedContext object which requires supplying a BoundedWindow abstract class, and PaneInfo Object Class, and then some shard info. 2）看起来要做到这一点，我需要创建一个WindowedContext对象，它需要提供一个BoundedWindow抽象类，PaneInfo对象类，然后是一些分片信息。 The information available for these is pretty bare and I'm having a hard time knowing what is actually needed for all of these, especially given my simple use case. 可用于这些的信息非常简单，我很难知道所有这些实际需要什么，特别是考虑到我的简单用例。 Are there any good examples available that implement these? 有没有很好的例子可以实现这些？ In addition, it also looks like I need the set the # of shards as part of TextIO.write, but then also supply # shards as part of the fileNamePolicy? 另外，看起来我还需要将#dith片段设置为TextIO.write的一部分，但是还提供#shads作为fileNamePolicy的一部分？

Thanks for anything in helping me understand the details behind this, hoping to learn a few things! 感谢您帮助我理解这背后的细节，希望学到一些东西！

Edit 7/20/17 So I finally got this pipeline to run with extending the FilenamePolicy. 编辑7/20/17所以我终于通过扩展FilenamePolicy来运行此管道。 My challenge was needing to define the window of the streaming data from PubSub. 我的挑战是需要从PubSub定义流数据的窗口。 Here is a pretty close representation of the code: 这是代码的非常接近的表示：

public class ReadData {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

        Pipeline p = Pipeline.create(options);

        p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
            .apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
            .apply("Write to GCS", TextIO.write().to("gcs_bucket")
                .withWindowedWrites()
                .withFilenamePolicy(new TestPolicy())
                .withNumShards(10));

        p.run();

    }
}

class TestPolicy extends FileBasedSink.FilenamePolicy {
    @Override
    public ResourceId windowedFilename(
        ResourceId outputDirectory, WindowedContext context, String extension) {
        IntervalWindow window = (IntervalWindow) context.getWindow();
        String filename = String.format(
            "%s-%s-%s-%s-of-%s.json",
            "test",
            window.start().toString(),
            window.end().toString(),
            context.getShardNumber(),
            context.getShardNumber()
        );
        return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
    }

    @Override
    public ResourceId unwindowedFilename(
        ResourceId outputDirectory, Context context, String extension) {
        throw new UnsupportedOperationException("Unsupported.");
    }
}

Answer 1

In Beam 2.0, the below is an example of writing the raw messages from PubSub out into windowed files on GCS. 在Beam 2.0中，下面是将PubSub中的原始消息写入GCS上的窗口文件的示例。 The pipeline is fairly configurable, allowing you to specify the window duration via a parameter and a sub directory policy if you want logical subsections of your data for ease of reprocessing / archiving. 管道是相当可配置的，如果您想要数据的逻辑子部分以便于重新处理/存档，则允许您通过参数和子目录策略指定窗口持续时间。 Note that this has an additional dependency on Apache Commons Lang 3. 请注意，这对Apache Commons Lang 3有额外的依赖性。

PubSubToGcs PubSubToGcs

/**
 * This pipeline ingests incoming data from a Cloud Pub/Sub topic and
 * outputs the raw data into windowed files at the specified output
 * directory.
 */
public class PubsubToGcs {

  /**
   * Options supported by the pipeline.
   * 
   * <p>Inherits standard configuration options.</p>
   */
  public static interface Options extends DataflowPipelineOptions, StreamingOptions {
    @Description("The Cloud Pub/Sub topic to read from.")
    @Required
    ValueProvider<String> getTopic();
    void setTopic(ValueProvider<String> value);

    @Description("The directory to output files to. Must end with a slash.")
    @Required
    ValueProvider<String> getOutputDirectory();
    void setOutputDirectory(ValueProvider<String> value);

    @Description("The filename prefix of the files to write to.")
    @Default.String("output")
    @Required
    ValueProvider<String> getOutputFilenamePrefix();
    void setOutputFilenamePrefix(ValueProvider<String> value);

    @Description("The shard template of the output file. Specified as repeating sequences "
        + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the "
        + "shard number, or number of shards respectively")
    @Default.String("")
    ValueProvider<String> getShardTemplate();
    void setShardTemplate(ValueProvider<String> value);

    @Description("The suffix of the files to write.")
    @Default.String("")
    ValueProvider<String> getOutputFilenameSuffix();
    void setOutputFilenameSuffix(ValueProvider<String> value);

    @Description("The sub-directory policy which files will use when output per window.")
    @Default.Enum("NONE")
    SubDirectoryPolicy getSubDirectoryPolicy();
    void setSubDirectoryPolicy(SubDirectoryPolicy value);

    @Description("The window duration in which data will be written. Defaults to 5m. "
        + "Allowed formats are: "
        + "Ns (for seconds, example: 5s), "
        + "Nm (for minutes, example: 12m), "
        + "Nh (for hours, example: 2h).")
    @Default.String("5m")
    String getWindowDuration();
    void setWindowDuration(String value);

    @Description("The maximum number of output shards produced when writing.")
    @Default.Integer(10)
    Integer getNumShards();
    void setNumShards(Integer value);
  }

  /**
   * Main entry point for executing the pipeline.
   * @param args  The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory
        .fromArgs(args)
        .withValidation()
        .as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   * 
   * @param options The execution parameters to the pipeline.
   * @return  The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /**
     * Steps:
     *   1) Read string messages from PubSub
     *   2) Window the messages into minute intervals specified by the executor.
     *   3) Output the windowed files to GCS
     */
    pipeline
      .apply("Read PubSub Events",
        PubsubIO
          .readStrings()
          .fromTopic(options.getTopic()))
      .apply(options.getWindowDuration() + " Window", 
          Window
            .into(FixedWindows.of(parseDuration(options.getWindowDuration()))))
      .apply("Write File(s)",
          TextIO
            .write()
            .withWindowedWrites()
            .withNumShards(options.getNumShards())
            .to(options.getOutputDirectory())
            .withFilenamePolicy(
                new WindowedFilenamePolicy(
                    options.getOutputFilenamePrefix(),
                    options.getShardTemplate(),
                    options.getOutputFilenameSuffix())
                .withSubDirectoryPolicy(options.getSubDirectoryPolicy())));

    // Execute the pipeline and return the result.
    PipelineResult result = pipeline.run();

    return result;
  }

  /**
   * Parses a duration from a period formatted string. Values
   * are accepted in the following formats:
   * <p>
   * Ns - Seconds. Example: 5s<br>
   * Nm - Minutes. Example: 13m<br>
   * Nh - Hours. Example: 2h
   * 
   * <pre>
   * parseDuration(null) = NullPointerException()
   * parseDuration("")   = Duration.standardSeconds(0)
   * parseDuration("2s") = Duration.standardSeconds(2)
   * parseDuration("5m") = Duration.standardMinutes(5)
   * parseDuration("3h") = Duration.standardHours(3)
   * </pre>
   * 
   * @param value The period value to parse.
   * @return  The {@link Duration} parsed from the supplied period string.
   */
  private static Duration parseDuration(String value) {
    Preconditions.checkNotNull(value, "The specified duration must be a non-null value!");

    PeriodParser parser = new PeriodFormatterBuilder()
      .appendSeconds().appendSuffix("s")
      .appendMinutes().appendSuffix("m")
      .appendHours().appendSuffix("h")
      .toParser();

    MutablePeriod period = new MutablePeriod();
    parser.parseInto(period, value, 0, Locale.getDefault());

    Duration duration = period.toDurationFrom(new DateTime(0));
    return duration;
  }
}

WindowedFilenamePolicy WindowedFilenamePolicy

/**
 * The {@link WindowedFilenamePolicy} class will output files
 * to the specified location with a format of output-yyyyMMdd'T'HHmmssZ-001-of-100.txt.
 */
@SuppressWarnings("serial")
public class WindowedFilenamePolicy extends FilenamePolicy {

    /**
     * Possible sub-directory creation modes.
     */
    public static enum SubDirectoryPolicy {
        NONE("."),
        PER_HOUR("yyyy-MM-dd/HH"),
        PER_DAY("yyyy-MM-dd");

        private final String subDirectoryPattern;

        private SubDirectoryPolicy(String subDirectoryPattern) {
            this.subDirectoryPattern = subDirectoryPattern;
        }

        public String getSubDirectoryPattern() {
            return subDirectoryPattern;
        }

        public String format(Instant instant) {
            DateTimeFormatter formatter = DateTimeFormat.forPattern(subDirectoryPattern);
            return formatter.print(instant);
        }
    }

    /**
     * The formatter used to format the window timestamp for outputting to the filename.
     */
    private static final DateTimeFormatter formatter = ISODateTimeFormat
            .basicDateTimeNoMillis()
            .withZone(DateTimeZone.getDefault());

    /**
     * The filename prefix.
     */
    private final ValueProvider<String> prefix;

    /**
     * The filenmae suffix.
     */
    private final ValueProvider<String> suffix;

    /**
     * The shard template used during file formatting.
     */
    private final ValueProvider<String> shardTemplate;

    /**
     * The policy which dictates when or if sub-directories are created
     * for the windowed file output.
     */
    private ValueProvider<SubDirectoryPolicy> subDirectoryPolicy = StaticValueProvider.of(SubDirectoryPolicy.NONE);

    /**
     * Constructs a new {@link WindowedFilenamePolicy} with the
     * supplied prefix used for output files.
     * 
     * @param prefix    The prefix to append to all files output by the policy.
     * @param shardTemplate The template used to create uniquely named sharded files.
     * @param suffix    The suffix to append to all files output by the policy.
     */
    public WindowedFilenamePolicy(String prefix, String shardTemplate, String suffix) {
        this(StaticValueProvider.of(prefix), 
                StaticValueProvider.of(shardTemplate),
                StaticValueProvider.of(suffix));
    }

    /**
     * Constructs a new {@link WindowedFilenamePolicy} with the
     * supplied prefix used for output files.
     * 
     * @param prefix    The prefix to append to all files output by the policy.
     * @param shardTemplate The template used to create uniquely named sharded files.
     * @param suffix    The suffix to append to all files output by the policy.
     */
    public WindowedFilenamePolicy(
            ValueProvider<String> prefix, 
            ValueProvider<String> shardTemplate, 
            ValueProvider<String> suffix) {
        this.prefix = prefix;
        this.shardTemplate = shardTemplate;
        this.suffix = suffix; 
    }

    /**
     * The subdirectory policy will create sub-directories on the
     * filesystem based on the window which has fired.
     * 
     * @param policy    The subdirectory policy to apply.
     * @return The filename policy instance.
     */
    public WindowedFilenamePolicy withSubDirectoryPolicy(SubDirectoryPolicy policy) {
        return withSubDirectoryPolicy(StaticValueProvider.of(policy));
    }

    /**
     * The subdirectory policy will create sub-directories on the
     * filesystem based on the window which has fired.
     * 
     * @param policy    The subdirectory policy to apply.
     * @return The filename policy instance.
     */
    public WindowedFilenamePolicy withSubDirectoryPolicy(ValueProvider<SubDirectoryPolicy> policy) {
        this.subDirectoryPolicy = policy;
        return this;
    }

    /**
     * The windowed filename method will construct filenames per window in the
     * format of output-yyyyMMdd'T'HHmmss-001-of-100.txt.
     */
    @Override
    public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext c, String extension) {
        Instant windowInstant = c.getWindow().maxTimestamp();
        String datetimeStr = formatter.print(windowInstant.toDateTime());

        // Remove the prefix when it is null so we don't append the literal 'null'
        // to the start of the filename
        String filenamePrefix = prefix.get() == null ? datetimeStr : prefix.get() + "-" + datetimeStr;
        String filename = DefaultFilenamePolicy.constructName(
                filenamePrefix, 
                shardTemplate.get(), 
                StringUtils.defaultIfBlank(suffix.get(), extension),  // Ignore the extension in favor of the suffix.
                c.getShardNumber(), 
                c.getNumShards());

        String subDirectory = subDirectoryPolicy.get().format(windowInstant);
        return outputDirectory
                .resolve(subDirectory, StandardResolveOptions.RESOLVE_DIRECTORY)
                .resolve(filename, StandardResolveOptions.RESOLVE_FILE);
    }

    /**
     * Unwindowed writes are unsupported by this filename policy so an {@link UnsupportedOperationException}
     * will be thrown if invoked.
     */
    @Override
    public ResourceId unwindowedFilename(ResourceId outputDirectory, Context c, String extension) {
    throw new UnsupportedOperationException("There is no windowed filename policy for unwindowed file"
        + " output. Please use the WindowedFilenamePolicy with windowed writes or switch filename policies.");
    }
}

Answer 2

In Beam currently the DefaultFilenamePolicy supports windowed writes, so there's no need to write a custom FilenamePolicy. 在Beam中，DefaultFilenamePolicy当前支持窗口写入，因此无需编写自定义FilenamePolicy。 You can control the output filename by putting W and P placeholders (for the window and pane respectively) in the filename template. 您可以通过在文件名模板中放置W和P占位符（分别用于窗口和窗格）来控制输出文件名。 This exists in the head beam repository, and will also be in the upcoming Beam 2.1 release (which is being released as we speak). 这存在于头梁存储库中，也将出现在即将发布的Beam 2.1版本中（正如我们所说，它正在发布）。

GCP Dataflow 2.0 PubSub到GCS

问题描述

2 个解决方案

解决方案1
6 2017-07-22 15:37:58

解决方案2
1 2017-07-18 17:53:54

GCP Dataflow 2.0 PubSub到GCS

问题描述

2 个解决方案

解决方案1 6 2017-07-22 15:37:58

解决方案2 1 2017-07-18 17:53:54

解决方案1
6 2017-07-22 15:37:58

解决方案2
1 2017-07-18 17:53:54