Google Cloud Data Fusion——从 REST API 端点源构建管道

Question

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.尝试构建管道以从 3rd 方 REST API 端点数据源读取。

I am using the HTTP (version 1.2.0) plugin found in the Hub.我正在使用 Hub 中的 HTTP（1.2.0 版）插件。

The response request URL is: https://api.example.io/v2/somedata?return_count=false响应请求 URL 为： https://api.example.io/v2/somedata?return_count=false : https://api.example.io/v2/somedata?return_count=false

A sample of response body:响应体示例：

{
  "paging": {
    "token": "12456789",
    "next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
  },
  "data": [
    {
      "cID": "aerrfaerrf",
      "first": true,
      "_id": "aerfaerrfaerrf",
      "action": "aerrfaerrf",
      "time": "1970-10-09T14:48:29+0000",
      "email": "example@aol.com"
    },
    {...}
  ]
}

The main error in the logs is:日志中的主要错误是：

java.lang.NullPointerException: null
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
    at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]

Possible issues可能的问题

After trying to troubleshoot this for awhile, I'm thinking the issue might be with在尝试解决此问题一段时间后，我认为问题可能出在

Pagination分页

Data Fusion HTTP plugin has a lot of methods to deal with pagination Data Fusion HTTP插件有很多处理分页的方法
- Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body根据上面的响应正文，分页类型的最佳选择似乎是响应正文中的Link in Response Body
- For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next .对于所需的Next Page JSON/XML Field Path参数，我尝试了$.paging.next和paging/next 。 Neither work.都不工作。
- I have verified that the link in /paging/next works when opening in Chrome我已验证/paging/next中的链接在 Chrome 中打开时有效

Authentication验证

When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password当只是尝试在 Chrome 中查看响应 URL 时，会弹出一个提示，要求输入用户名和密码
- Only need to input API key for username to get past this prompt in Chrome只需要输入用户名的 API 密钥即可在 Chrome 中通过此提示
- To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section为此，在 Data Fusion HTTP 插件中，API 密钥用于基本身份验证部分中的用户名

Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功？

Answer 1

In answer to在回答

Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功？

This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API .这不是实现这一目标的最佳方法，最好的方法是将数据服务 API 概述摄取到发布/订阅，然后您将使用发布/订阅作为管道的源，这将为您的数据提供一个简单可靠的暂存位置它用于处理、存储和分析，请参阅发布/订阅 API 的文档。 In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow为了将其与 Dataflow 结合使用，要遵循的步骤在此处的官方文档中使用 Pub/Sub 与 Dataflow

Answer 2

I think your problem is in the data format that you receive.我认为您的问题在于您收到的数据格式。 The exception:例外：

java.lang.NullPointerException: null

occurs when you do not specify a correct output schema (no schema in this case I believe)当您没有指定正确的输出模式时发生（我相信在这种情况下没有模式）

Solution 1解决方案1

To solve it, try configuring the HTTP Data Fusion plugin to:要解决它，请尝试将 HTTP Data Fusion 插件配置为：

Receive format: Text.接收格式：文本。
Output Schema: name: user Type: String输出架构：名称：用户类型：字符串

This should work to obtain the response from the API in string format.这应该可以从 API 以字符串格式获取响应。 Once that is done, use a JSONParser to convert the string into a table like object.完成后，使用 JSONParser 将字符串转换为类似表的对象。

Solution 2解决方案2

Configure the HTTP Data Fusion plugin to:将 HTTP 数据融合插件配置为：

Receive format: json接收格式：json
JSON/XML Result Path : data JSON/XML 结果路径：数据
JSON/XML Fields Mapping : Include the fields you presented (see attached foto). JSON/XML 字段映射：包括您提供的字段（见附件照片）。

Google Cloud Data Fusion——从 REST API 端点源构建管道

问题描述

Possible issues可能的问题

Pagination分页

Authentication验证

2 个解决方案

解决方案1
1 2020-02-04 15:16:29

解决方案2
0 2021-03-19 09:13:23

Google Cloud Data Fusion——从 REST API 端点源构建管道

问题描述

Possible issues可能的问题

Pagination分页

Authentication验证

2 个解决方案

解决方案1 1 2020-02-04 15:16:29

解决方案2 0 2021-03-19 09:13:23

解决方案1
1 2020-02-04 15:16:29

解决方案2
0 2021-03-19 09:13:23