Google Cloud Dataflow：正在執行提交的作業，但使用的是舊代碼

Question

我正在編寫一個應執行3件事的數據流管道：

從GCP存儲讀取.csv文件
將數據解析到BigQuery可適配的TableRows
將數據寫入BigQuery表

直到現在，這一切都像魅力一樣。 它仍然可以，但是當我更改源變量和目標變量時，沒有任何變化。 實際運行的工作是舊的，而不是最近更改（並提交）的代碼。 當我使用BlockingDataflowPipelineRunner從Eclipse運行代碼時，代碼本身並未上傳，而是使用了較舊的版本。

通常，代碼沒有什么錯，但要盡可能完整：

public class BatchPipeline {
    String source = "gs://sourcebucket/*.csv";
    String destination = "projectID:datasetID.testing1";    

    //Creation of the pipeline with default arguments
    Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());

    PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
            .from(source));

    @SuppressWarnings("serial")
    PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
        @Override
        public void processElement(ProcessContext c){
             //processing code goes here
        }
    }));

    //Defining the BigQuery table scheme
    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
    fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
    fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
    TableSchema schema = new TableSchema().setFields(fields);
    String table = destination;

    tablerows.apply(BigQueryIO.Write
            .named("BigQueryWrite")
            .to(table)
            .withSchema(schema)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withoutValidation());

    //Runs the pipeline
    p.run();
}

之所以出現此問題，是因為我剛剛更換了筆記本電腦，並且不得不重新配置所有東西。 我正在使用一個干凈的Ubuntu 16.04 LTS操作系統，該操作系統已安裝了GCP開發的所有依賴項（通常）。 正常情況下，由於我能夠開始工作，因此一切配置都很好（如果我的配置錯誤，這應該是不可能的，對吧？）。 我正在使用Eclipse Neon btw。

那么問題出在哪里呢？ 在我看來，上傳代碼時出現問題，但是我已確保我的雲git repo是最新的，並且暫存桶已被清理...

**** 更新 ****

我從來沒有發現到底出了什么問題，但是當我在部署的jar中檢出文件的創建日期時，確實看到它們從未真正更新過。 但是jar文件本身具有最近的時間戳，這使我完全忽略了該問題（菜鳥錯誤）。

我最終通過在Eclipse中創建一個新的Dataflow項目並將我的.java文件從損壞的項目復制到新項目中，再次使所有工作恢復正常。 從那時起，一切都像魅力一樣運作。

Answer 1

提交數據流作業后，您可以通過檢查屬於作業描述的文件來檢查哪些工件屬於作業規范，這些文件可通過DataflowPipelineWorkerPoolOptions＃getFilesToStage獲得。 下面的代碼段提供了一些有關如何獲取此信息的示例。

PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);

// Build up your pipeline and run it.
p.apply(...)
p.run();

// At this point in time, the files which were staged by the 
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
  System.out.println(stagedFile);
}

上面的代碼應打印出類似以下內容：

/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar

您上載的工作資源部分可能已經過時，其中包含舊代碼。 瀏覽暫存列表的所有目錄和jar部分，找到BatchPipeline所有實例並驗證其BatchPipeline 。 jar文件可以使用提取的jar工具或任何zip文件閱讀器。 或者，使用javap或任何其他類文件檢查器來驗證BatchPipeline類文件是否與您所做的預期更改BatchPipeline 。

Google Cloud Dataflow：正在執行提交的作業，但使用的是舊代碼

問題描述

1 個解決方案

解決方案1
1 已采納 2017-01-05 17:12:57

Google Cloud Dataflow：正在執行提交的作業，但使用的是舊代碼

問題描述

1 個解決方案

解決方案1 1 已采納 2017-01-05 17:12:57

解決方案1
1 已采納 2017-01-05 17:12:57