I have a CSV file in Storage and I want to read it and write it into BigQuery Table. this is my CSV file where the first line is the header:
GroupName,Groupcode,GroupOwner,GroupCategoryID
System Administrators,sysadmin,13456,100
Independence High Teachers,HS Teachers,,101
John Glenn Middle Teachers,MS Teachers,13458,102
Liberty Elementary Teachers,Elem Teachers,13559,103
1st Grade Teachers,1stgrade,,104
2nd Grade Teachers,2nsgrade,13561,105
3rd Grade Teachers,3rdgrade,13562,106
Guidance Department,guidance,,107
Independence Math Teachers,HS Math,13660,108
Independence English Teachers,HS English,13661,109
John Glenn 8th Grade Teachers,8thgrade,,110
John Glenn 7th Grade Teachers,7thgrade,13452,111
Elementary Parents,Elem Parents,,112
Middle School Parents,MS Parents,18001,113
High School Parents,HS Parents,18002,114
this is my code:
public class StorgeBq {
public static class StringToRowConverter extends DoFn<String, TableRow> {
private String[] columnNames;
private boolean isFirstRow = true;
@ProcessElement
public void processElement(ProcessContext c) {
TableRow row = new TableRow();
String[] parts = c.element().split(",");
if (isFirstRow) {
columnNames = Arrays.copyOf(parts, parts.length);
isFirstRow = false;
} else {
for (int i = 0; i < parts.length; i++) {
row.set(columnNames[i], parts[i]);
}
c.output(row);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(DataflowPipelineOptions.class);
options.setZone("europe-west1-c");
options.setProject("mydata-dev");
options.setRunner(DataflowRunner.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from("gs://mydata3-dataflow/C2ImportGroupsSample.csv"))
.apply("ConverToBqRow",ParDo.of(new StringToRowConverter()))
.apply("WriteToBq", BigQueryIO.<TableRow>writeTableRows()
.to("mydata-dev:DF_TEST.dataflow_table")
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_NEVER));
p.run().waitUntilFinish();
}
}
There are some problems: 1) when the job starts executing, I see there is a process called "DropInputs" which I have not defined in my code!! and starts running before all tasks, Why??
2) Why the pipline doesn't start with the first Task "ReadLines" ? 3) In the log file, I see that in the task "WriteToBq" it tries to find one of the data as field,for example "1st Grade Teachers" is not a field but a data for "GroupName" :
"message" : "JSON parsing error in row starting at position 0: No such field: 1st Grade Teachers.",
You've a couple of problems in your code. But, first of all, regarding the "DropInputs" stage - you can safely ignore it. It was the result of this bug report. I still don't understand why it needs to be displayed (it's confusing a lot of our users too), and I'd love for a Googler to chime in on that. In my opinion it's just clutter.
Right, to your code now:
boolean
flag to check, check the string
value itself each time in your ParDo
eg if (c.element.contains("GroupName") then..
withSchema(..)
to your BigQuery sink. Here's an example from one of my public pipelines.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.