简体   繁体   English

使用 Apache Beam 将重复的字符串写入 BigQuery

[英]Write repeated Strings to BigQuery using Apache Beam

I have a data stream containing Strings which look like JSONArrays .我有一个数据 stream 包含看起来像JSONArraysStrings I want to parse those Strings and write to BigQuery table using Apache Beam but am getting an error while writing repeated Strings.我想解析这些字符串并使用 Apache Beam 写入 BigQuery 表,但在写入重复字符串时出现错误。

Here´s how I convert my string to TableRow :这是我将字符串转换为TableRow的方法:

    String dataString = "[{\"EMAIL\": [\"zog@yahoo.com\"]}]";

    JSONArray jsonArray = new JSONArray(dataString);
    TableRow tableRow = new TableRow();

    for (int i = 0; i < jsonArray.length(); i++) {
      JSONArray emailArray = new JSONArray(jsonArray.getJSONObject(i).get("EMAIL").toString());

      tableRow.set("EMAIL", emailArray); //Results in error
    }

Here´s what my BigQuery schema looks like:这是我的 BigQuery 架构的样子:

[
  {
    "name": "EMAIL",
    "type": "STRING",
    "mode": "REPEATED"
  }
]

I have managed to write a similar repeated String to BigQuery table using Python but unable to do it using Apache Beam.我已经设法使用 Python 将类似的重复字符串写入 BigQuery 表,但使用 Apache Beam 无法做到这一点。 I suppose I am not saving the right key-value pair in TableRow .我想我没有在TableRow中保存正确的键值对。 The error I am getting now is:我现在得到的错误是:

java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"email","message":"This field is not a record.","reason":"invalid"}],"index":0}]

I need help regarding how to save a similar repeated String to BigQuery without creating a record and would appreciate any advice or suggestions.我需要有关如何在不创建记录的情况下将类似的重复字符串保存到 BigQuery 的帮助,并且希望获得任何建议或建议。 Thanks in advance.提前致谢。

It seems you want to create看来你想创建

  1. one row with a concatenated String of email addresses, or一行包含 email 地址的串联字符串,或
  2. a row per email , or每个 email 一行,或
  3. one row with a repeated field.具有重复字段的一行。

Note that is seems your ValidFrom field is of type STRING , not a repeated field, unless it is wrapped in a repeated field in a hierarchical schema.请注意,您的ValidFrom字段似乎是STRING类型,而不是重复字段,除非它包含在分层架构中的重复字段中。

In the example code you provided, you are creating a JSONArray and putting it into the STRING field, which I think cause issues as the types are incompatible.在您提供的示例代码中,您正在创建一个JSONArray并将其放入STRING字段,我认为这会导致问题,因为类型不兼容。 If you want to keep it as a plain STRING field, you can use Solution 1 below.如果要将其保留为纯STRING字段,可以使用下面的解决方案 1。

Also make sure that the name of your column in BigQuery matches the one in your code, I see you use both ValidFrom and EMAIL (might be a mistake in your posted code though).还要确保 BigQuery 中的列名称与代码中的名称相匹配,我看到您同时使用ValidFromEMAIL (尽管您发布的代码中可能有错误)。

Solution 1: One row with String field解决方案 1:一行字符串字段

In case you want to add one row with a concatenated String field in BigQuery, you can use the following:如果您想在 BigQuery 中添加一行具有串联String字段的行,可以使用以下命令:

// Initialize your final row
TableRow tableRow = new TableRow();

// Find email addresses
String [] emails = ... // your extraction logic

// Build a concatenated string of emails
String allEmails = String.join(";", emails);

// Add the string field to the row
tableRow.set('EMAILS', allEmails);

Solution 2: Multiple rows with String field解决方案 2:带有字符串字段的多行

In case you want to insert multiple rows , you your create multiple table rows:如果您想插入多行,您可以创建多个表行:

// Find email addresses
String [] emails = ... // your extraction logic

// Build a row per email
for(String email: emails) {
    // Initialize your final row
    TableRow tableRow = new TableRow();
    tableRow.set('EMAIL', email);
    
    // TODO: do something with the row (add to list, or ...)
}

Solution 3: One row with REPEATED field解决方案 3:一行包含 REPEATED 字段

In case you want to add one row with a REPEATED STRING field in BigQuery, you can use the following:如果您想在 BigQuery 中添加一行带有REPEATED STRING字段的行,您可以使用以下命令:

// Initialize your final row
TableRow tableRow = new TableRow();

// Find email addresses
String [] emails = ... // your extraction logic

// Build the repeated field
List<String> emailCells = new ArrayList<>();
for(String email: emails) {
    emailCells.add(email);
}

// Add the repeated field to the row
tableRow.set('EMAILS', emailCells);

If this is not what you're aiming for, please provide some more details.如果这不是您的目标,请提供更多详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM