简体   繁体   English

Google Cloud Data Fusion——从 REST API 端点源构建管道

[英]Google Cloud Data Fusion -- building pipeline from REST API endpoint source

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.尝试构建管道以从 3rd 方 REST API 端点数据源读取。

I am using the HTTP (version 1.2.0) plugin found in the Hub.我正在使用 Hub 中的 HTTP(1.2.0 版)插件。

The response request URL is: https://api.example.io/v2/somedata?return_count=false响应请求 URL 为: https://api.example.io/v2/somedata?return_count=false : https://api.example.io/v2/somedata?return_count=false

A sample of response body:响应体示例:

{
  "paging": {
    "token": "12456789",
    "next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
  },
  "data": [
    {
      "cID": "aerrfaerrf",
      "first": true,
      "_id": "aerfaerrfaerrf",
      "action": "aerrfaerrf",
      "time": "1970-10-09T14:48:29+0000",
      "email": "example@aol.com"
    },
    {...}
  ]
}

The main error in the logs is:日志中的主要错误是:

java.lang.NullPointerException: null
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
    at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
    at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]

Possible issues可能的问题

After trying to troubleshoot this for awhile, I'm thinking the issue might be with在尝试解决此问题一段时间后,我认为问题可能出在

Pagination分页

  • Data Fusion HTTP plugin has a lot of methods to deal with pagination Data Fusion HTTP插件有很多处理分页的方法
    • Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body根据上面的响应正文,分页类型的最佳选择似乎是响应正文中的Link in Response Body
    • For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next .对于所需的Next Page JSON/XML Field Path参数,我尝试了$.paging.nextpaging/next Neither work.都不工作。
    • I have verified that the link in /paging/next works when opening in Chrome我已验证/paging/next中的链接在 Chrome 中打开时有效

Authentication验证

  • When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password当只是尝试在 Chrome 中查看响应 URL 时,会弹出一个提示,要求输入用户名和密码
    • Only need to input API key for username to get past this prompt in Chrome只需要输入用户名的 API 密钥即可在 Chrome 中通过此提示
    • To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section为此,在 Data Fusion HTTP 插件中,API 密钥用于基本身份验证部分中的用户名

Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功?

In answer to在回答

Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?有没有人在 Google Cloud Data Fusion 中创建数据源是 REST API 的管道方面取得成功?

This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API .这不是实现这一目标的最佳方法,最好的方法是将数据服务 API 概述摄取到发布/订阅,然后您将使用发布/订阅作为管道的源,这将为您的数据提供一个简单可靠的暂存位置它用于处理、存储和分析,请参阅发布/订阅 API 的文档。 In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow为了将其与 Dataflow 结合使用,要遵循的步骤在此处的官方文档中使用 Pub/Sub 与 Dataflow

I think your problem is in the data format that you receive.我认为您的问题在于您收到的数据格式。 The exception:例外:

java.lang.NullPointerException: null

occurs when you do not specify a correct output schema (no schema in this case I believe)当您没有指定正确的输出模式时发生(我相信在这种情况下没有模式)

Solution 1解决方案1

To solve it, try configuring the HTTP Data Fusion plugin to:要解决它,请尝试将 HTTP Data Fusion 插件配置为:

  • Receive format: Text.接收格式:文本。
  • Output Schema: name: user Type: String输出架构:名称:用户类型:字符串

This should work to obtain the response from the API in string format.这应该可以从 API 以字符串格式获取响应。 Once that is done, use a JSONParser to convert the string into a table like object.完成后,使用 JSONParser 将字符串转换为类似表的对象。

Solution 2解决方案2

Configure the HTTP Data Fusion plugin to:将 HTTP 数据融合插件配置为:

  • Receive format: json接收格式:json
  • JSON/XML Result Path : data JSON/XML 结果路径:数据
  • JSON/XML Fields Mapping : Include the fields you presented (see attached foto). JSON/XML 字段映射:包括您提供的字段(见附件照片)。

HTTP插件配置

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用于 Google Cloud Fusion 的 REST API - REST API for Google Cloud Fusion 来自浏览器的Google Cloud Datastore REST API - Google Cloud Datastore REST API from a browser Amazon Cloud Drive REST API终端节点 - Amazon Cloud Drive REST api endpoint 从原始数据源获取数据集 - Google Fit REST API - Getting data sets from a raw data source - Google Fit REST API 如何使用Google App Engine(Java)创建剩余端点以将多部分数据上传到Google云存储 - How to create a rest endpoint using Google App Engine (Java) to upload multi part data to google cloud storage 使用Scala中的REST API从Google Cloud ML Engine请求数据 - Requesting data from Google Cloud ML Engine using the REST API in Scala 如何使用rest api将数据上传到谷歌云存储 - how to upload data to google cloud storage using rest api 如何通过在 REST API 查询参数中发送两个日期来从 HCM Fusion 获取员工缺勤数据 - How to GET Employee Absences data from HCM Fusion by sending two dates in REST API query parameter 修复Google Fusion Table REST API的“403 Forbidden”错误 - Fix “403 Forbidden” Error with Google Fusion Table REST API 应用程序脚本中的Google Fusion Table REST Api与高级服务Fusion Table Services - Google Fusion Table REST Api vs Advanced Services Fusion Table Services in app scripts
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM