简体繁体 English

在大数据平台上“近实时”从新闻Web API提取数据的最佳方法

[英]Best possible way to extract data from news web API in 'Near Real Time' on Big Data Platform

原文 2017-07-04 16:29:13 3 1 hadoop/ bigdata/ apache-nifi

I have an use case, where in the first step is ingestion of data from news API's or news aggregator API's , into HDFS. 我有一个用例，其中第一步是将新闻API或新闻聚合器API的数据提取到HDFS中。 This data fetch is to be done on a NRT basis(say every 15 mins) Presently I am working on 2 approaches: 此数据获取将在NRT的基础上完成（例如每15分钟一次）目前，我正在研究2种方法：

Python based solution.(For now, its not generic code). 基于Python的解决方案。（目前，它不是通用代码）。
Apache NiFi based framework.(But NiFi seems to have some compatibility issues on other distributions other than Hortonworks) 基于Apache NiFi的框架。（但NiFi似乎在Hortonworks之外的其他发行版上存在一些兼容性问题）

It would be great to have few more suggestions for an approach which would be platform independent and could be used across different Hadoop Distributions(Cloudera,HW etc). 希望有更多关于该方法的建议，该方法应独立于平台并且可以在不同的Hadoop发行版（Cloudera，HW等）中使用。

Thanks. 谢谢。

1 个解决方案

Apache NiFi can definitely handle your process, and it works well on Windows, MacOS, and most Linux distributions (I've run it on Ubuntu, Redhat, CentOS, Amazon Linux, and Raspbian). Apache NiFi绝对可以处理您的过程，并且可以在Windows，MacOS和大多数Linux发行版上很好地运行（我已经在Ubuntu，Redhat，CentOS，Amazon Linux和Raspbian上运行了它）。 It doesn't need Hadoop but can work with either Hortonworks or Cloudera Hadoop distributions. 它不需要Hadoop，但可以与Hortonworks或Cloudera Hadoop发行版一起使用。

I built an RSS viewer with NiFi that fetched, extracted, and saved RSS to disk using GetHTTP -> TransformXML -> PutFile . 我使用NiFi构建了RSS查看器，并使用GetHTTP- > TransformXML- > PutFile将RSS提取，提取并保存到磁盘。 NiFi then listended for browser requests and returned the RSS as an HTML table using HandleHttpRequest -> GetFile -> TransformXML -> HandleHttpResponse . NiFi然后listended的浏览器请求和返回的RSS作为使用HTML表格HandleHttpRequest - > 的GetFile - > 的TransformXML - > HandleHttpResponse 。