[英]Google Cloud Data Fusion -- building pipeline from REST API endpoint source
Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.尝试构建管道以从 3rd 方 REST API 端点数据源读取。
I am using the HTTP (version 1.2.0) plugin found in the Hub.我正在使用 Hub 中的 HTTP(1.2.0 版)插件。
The response request URL is: https://api.example.io/v2/somedata?return_count=false
响应请求 URL 为:
https://api.example.io/v2/somedata?return_count=false
: https://api.example.io/v2/somedata?return_count=false
A sample of response body:响应体示例:
{
"paging": {
"token": "12456789",
"next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
},
"data": [
{
"cID": "aerrfaerrf",
"first": true,
"_id": "aerfaerrfaerrf",
"action": "aerrfaerrf",
"time": "1970-10-09T14:48:29+0000",
"email": "example@aol.com"
},
{...}
]
}
The main error in the logs is:日志中的主要错误是:
java.lang.NullPointerException: null
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
After trying to troubleshoot this for awhile, I'm thinking the issue might be with在尝试解决此问题一段时间后,我认为问题可能出在
Link in Response Body
Link in Response Body
$.paging.next
and paging/next
.$.paging.next
和paging/next
。 Neither work./paging/next
works when opening in Chrome/paging/next
中的链接在 Chrome 中打开时有效Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功?
In answer to在回答
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功?
This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API .这不是实现这一目标的最佳方法,最好的方法是将数据服务 API 概述摄取到发布/订阅,然后您将使用发布/订阅作为管道的源,这将为您的数据提供一个简单可靠的暂存位置它用于处理、存储和分析,请参阅发布/订阅 API 的文档。 In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow
为了将其与 Dataflow 结合使用,要遵循的步骤在此处的官方文档中使用 Pub/Sub 与 Dataflow
I think your problem is in the data format that you receive.我认为您的问题在于您收到的数据格式。 The exception:
例外:
java.lang.NullPointerException: null
occurs when you do not specify a correct output schema (no schema in this case I believe)当您没有指定正确的输出模式时发生(我相信在这种情况下没有模式)
Solution 1解决方案1
To solve it, try configuring the HTTP Data Fusion plugin to:要解决它,请尝试将 HTTP Data Fusion 插件配置为:
This should work to obtain the response from the API in string format.这应该可以从 API 以字符串格式获取响应。 Once that is done, use a JSONParser to convert the string into a table like object.
完成后,使用 JSONParser 将字符串转换为类似表的对象。
Solution 2解决方案2
Configure the HTTP Data Fusion plugin to:将 HTTP 数据融合插件配置为:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.