Google Cloud - Pub/Sub into DataFlow

Question

I'm calling Pub/Sub via a REST request. I'm trying to put columnised data on a topic on Pub/Sub, which then goes into DataFlow, and finally into Big Query where a Table has been defined.

This is the layout of said JSON Data:

[
  {
    "age": "58",
    "job": "management",
    "marital": "married",
    "education": "tertiary",
    "default": "no",
    "balance": "2143",
    "housing": "yes",
    "loan": "no",
    "contact": "unknown",
    "day": "5",
    "month": "may",
    "duration": "261",
    "campaign": "1",
    "pdays": "-1",
    "previous": "0",
    "poutcome": "unknown",
    "y": "no"
    }
]

Now, to formate the correct JSON body this needs to go into the following request for Pub/Sub to recognise:

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "%DATA%"
    }]
}

Now, Pub/Sub REST reference states that the "Data" field needs to be converted into Base64, so that is what I do, and the final JSON Format is as follows ( %DATA% is replaced with the Base64 conversion of the original message data)

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "Ww0KICB7DQogICAgImFnZSI6ICI1OCIsDQogICAgImpvYiI6ICJtYW5hZ2VtZW50IiwNCiAgICAibWFyaXRhbCI6ICJtYXJyaWVkIiwNCiAgICAiZWR1Y2F0aW9uIjogInRlcnRpYXJ5IiwNCiAgICAiZGVmYXVsdCI6ICJubyIsDQogICAgImJhbGFuY2UiOiAiMjE0MyIsDQogICAgImhvdXNpbmciOiAieWVzIiwNCiAgICAibG9hbiI6ICJubyIsDQogICAgImNvbnRhY3QiOiAidW5rbm93biIsDQogICAgImRheSI6ICI1IiwNCiAgICAibW9udGgiOiAibWF5IiwNCiAgICAiZHVyYXRpb24iOiAiMjYxIiwNCiAgICAiY2FtcGFpZ24iOiAiMSIsDQogICAgInBkYXlzIjogIi0xIiwNCiAgICAicHJldmlvdXMiOiAiMCIsDQogICAgInBvdXRjb21lIjogInVua25vd24iLA0KICAgICJ5IjogIm5vIg0KICAgIH0NCl0="
    }]
}

Pub/Sub allows this data and then puts it into DataFlow, but this is where everything breaks. DataFlow tries to deserialize the information, but that fails with the following message:

(efdf538fc01f50b0): java.lang.RuntimeException: Unable to parse input
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:58)
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:47)
        org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Can not deserialize instance of com.google.api.services.bigquery.model.TableRow out of START_ARRAY token
 at [Source: [{"age":"32","job":"\"admin.\"","marital":"\"single\"","education":"\"secondary\"","default":"\"no\"","balance":"5","housing":"\"yes\"","loan":"\"no\"","contact":"\"unknown\"","day":"12","month":"\"may\"","duration":"593","campaign":"2","pdays":"-1","previous":"0","poutcome":"\"unknown\"","y":"\"no\""}]; line: 1, column: 1]

I think it is something to do with how the "data": field is being formatted, but I've tried other methods and I just can't get anything to work.

Answer 1

After further experimentation, the issue was indeed how the JSON was formatted. By removing the opening [ and closing ] DataFlow was indeed able to recognise the data and then put it into BigQuery.

Answer 2

尝试通过 ProtoBuf 序列化 JSON 数据，在 Beam 管道中读取数据后反序列化数据（假设您使用的是 Apache Beam），然后在将其写入 BigQuery 之前，将其编码为字节字符串。

Google Cloud - Pub/Sub into DataFlow

Question

2 answers

solution1
5 2018-02-07 09:04:16

solution2
0 2022-07-04 15:04:40

Google Cloud - Pub/Sub into DataFlow

Question

2 answers

solution1 5 2018-02-07 09:04:16

solution2 0 2022-07-04 15:04:40

solution1
5 2018-02-07 09:04:16

solution2
0 2022-07-04 15:04:40