简体   繁体   中英

GCP Dataflow- read CSV file from Storage and write into BigQuery

I have a CSV file in Storage and I want to read it and write it into BigQuery Table. this is my CSV file where the first line is the header:

GroupName,Groupcode,GroupOwner,GroupCategoryID
System Administrators,sysadmin,13456,100
Independence High Teachers,HS Teachers,,101
John Glenn Middle Teachers,MS Teachers,13458,102
Liberty Elementary Teachers,Elem Teachers,13559,103
1st Grade Teachers,1stgrade,,104
2nd Grade Teachers,2nsgrade,13561,105
3rd Grade Teachers,3rdgrade,13562,106
Guidance Department,guidance,,107
Independence Math Teachers,HS Math,13660,108
Independence English Teachers,HS English,13661,109
John Glenn 8th Grade Teachers,8thgrade,,110
John Glenn 7th Grade Teachers,7thgrade,13452,111
Elementary Parents,Elem Parents,,112
Middle School Parents,MS Parents,18001,113
High School Parents,HS Parents,18002,114

this is my code:

    public class StorgeBq {

        public static class StringToRowConverter extends DoFn<String, TableRow> {

            private String[] columnNames;

            private boolean isFirstRow = true;

            @ProcessElement
            public void processElement(ProcessContext c) {
                TableRow row = new TableRow();

                String[] parts = c.element().split(",");

                if (isFirstRow) {
                    columnNames = Arrays.copyOf(parts, parts.length);
                    isFirstRow = false;
                } else {
                    for (int i = 0; i < parts.length; i++) {
                        row.set(columnNames[i], parts[i]);
                    }
                    c.output(row);
                }
            }
        }

        public static void main(String[] args) {

            DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
                      .as(DataflowPipelineOptions.class);
                    options.setZone("europe-west1-c");
                    options.setProject("mydata-dev");
                    options.setRunner(DataflowRunner.class);
                    Pipeline p = Pipeline.create(options);

            p.apply("ReadLines", TextIO.read().from("gs://mydata3-dataflow/C2ImportGroupsSample.csv"))
            .apply("ConverToBqRow",ParDo.of(new StringToRowConverter()))
            .apply("WriteToBq", BigQueryIO.<TableRow>writeTableRows()
                    .to("mydata-dev:DF_TEST.dataflow_table")
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER));
            p.run().waitUntilFinish();
        }

}

There are some problems: 1) when the job starts executing, I see there is a process called "DropInputs" which I have not defined in my code!! and starts running before all tasks, Why?? 在此处输入图片说明

2) Why the pipline doesn't start with the first Task "ReadLines" ? 3) In the log file, I see that in the task "WriteToBq" it tries to find one of the data as field,for example "1st Grade Teachers" is not a field but a data for "GroupName" :

"message" : "JSON parsing error in row starting at position 0: No such field: 1st Grade Teachers.",

You've a couple of problems in your code. But, first of all, regarding the "DropInputs" stage - you can safely ignore it. It was the result of this bug report. I still don't understand why it needs to be displayed (it's confusing a lot of our users too), and I'd love for a Googler to chime in on that. In my opinion it's just clutter.

Right, to your code now:

  1. You are assuming that the first row read will be your header. This is an incorrect assumption. Dataflow reads in parallel, so the header row may arrive at any time. Instead of using a boolean flag to check, check the string value itself each time in your ParDo eg if (c.element.contains("GroupName") then..
  2. You are missing the BigQuery table schema. You need to add withSchema(..) to your BigQuery sink. Here's an example from one of my public pipelines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM