简体   繁体   中英

Google Cloud - Pub/Sub into DataFlow

I'm calling Pub/Sub via a REST request. I'm trying to put columnised data on a topic on Pub/Sub, which then goes into DataFlow, and finally into Big Query where a Table has been defined.

This is the layout of said JSON Data:

[
  {
    "age": "58",
    "job": "management",
    "marital": "married",
    "education": "tertiary",
    "default": "no",
    "balance": "2143",
    "housing": "yes",
    "loan": "no",
    "contact": "unknown",
    "day": "5",
    "month": "may",
    "duration": "261",
    "campaign": "1",
    "pdays": "-1",
    "previous": "0",
    "poutcome": "unknown",
    "y": "no"
    }
]

Now, to formate the correct JSON body this needs to go into the following request for Pub/Sub to recognise:

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "%DATA%"
    }]
}

Now, Pub/Sub REST reference states that the "Data" field needs to be converted into Base64, so that is what I do, and the final JSON Format is as follows ( %DATA% is replaced with the Base64 conversion of the original message data)

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "Ww0KICB7DQogICAgImFnZSI6ICI1OCIsDQogICAgImpvYiI6ICJtYW5hZ2VtZW50IiwNCiAgICAibWFyaXRhbCI6ICJtYXJyaWVkIiwNCiAgICAiZWR1Y2F0aW9uIjogInRlcnRpYXJ5IiwNCiAgICAiZGVmYXVsdCI6ICJubyIsDQogICAgImJhbGFuY2UiOiAiMjE0MyIsDQogICAgImhvdXNpbmciOiAieWVzIiwNCiAgICAibG9hbiI6ICJubyIsDQogICAgImNvbnRhY3QiOiAidW5rbm93biIsDQogICAgImRheSI6ICI1IiwNCiAgICAibW9udGgiOiAibWF5IiwNCiAgICAiZHVyYXRpb24iOiAiMjYxIiwNCiAgICAiY2FtcGFpZ24iOiAiMSIsDQogICAgInBkYXlzIjogIi0xIiwNCiAgICAicHJldmlvdXMiOiAiMCIsDQogICAgInBvdXRjb21lIjogInVua25vd24iLA0KICAgICJ5IjogIm5vIg0KICAgIH0NCl0="
    }]
}

Pub/Sub allows this data and then puts it into DataFlow, but this is where everything breaks. DataFlow tries to deserialize the information, but that fails with the following message:

(efdf538fc01f50b0): java.lang.RuntimeException: Unable to parse input
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:58)
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:47)
        org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Can not deserialize instance of com.google.api.services.bigquery.model.TableRow out of START_ARRAY token
 at [Source: [{"age":"32","job":"\"admin.\"","marital":"\"single\"","education":"\"secondary\"","default":"\"no\"","balance":"5","housing":"\"yes\"","loan":"\"no\"","contact":"\"unknown\"","day":"12","month":"\"may\"","duration":"593","campaign":"2","pdays":"-1","previous":"0","poutcome":"\"unknown\"","y":"\"no\""}]; line: 1, column: 1]

I think it is something to do with how the "data": field is being formatted, but I've tried other methods and I just can't get anything to work.

After further experimentation, the issue was indeed how the JSON was formatted. By removing the opening [ and closing ] DataFlow was indeed able to recognise the data and then put it into BigQuery.

尝试通过 ProtoBuf 序列化 JSON 数据,在 Beam 管道中读取数据后反序列化数据(假设您使用的是 Apache Beam),然后在将其写入 BigQuery 之前,将其编码为字节字符串。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM