简体   繁体   English

如何使用 spark 从 HTTP 源下载内容并将数据提取到 HDFS

[英]How to Download content from HTTP source and Ingest Data to HDFS using spark

I have a use case to download content from an HTTP source and Ingest it to HDFS using python, the data available in the source is not live data, It has some content that changes once a week and I have to download the updated content every week, no of files to download would be 50k to 80k files and I have to do this by multithreading我有一个用例从 HTTP 源下载内容并使用 python 将其摄取到 HDFS,源中可用的数据不是实时数据,它有一些内容每周更改一次,我必须每周下载更新的内容,要下载的文件不会是 50k 到 80k 文件,我必须通过多线程来完成

I have a couple of questions我有一些问题

  1. Can I use spark for this scenario?我可以在这种情况下使用 spark 吗? If so please tell me how to use it (or) point me to some spark resources如果是这样,请告诉我如何使用它(或)指出一些 spark 资源

  2. If spark is not a good fit for the scenario, then what else could I use?如果 spark 不适合该场景,那么我还能使用什么?

Thanks in Advance提前致谢

Spark would be a good choice if you want to move the file into HFDS.如果您想将文件移动到 HFDS,Spark 将是一个不错的选择。

This thread provides to way of achieving this: Adding a file to spark context:线程提供了实现此目的的方法:Adding a file to spark context:

 from pyspark import SparkFiles from pyspark.sql import SparkSession spark = SparkSession.builder.appName("test").getOrCreate() url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz" spark.sparkContext.addFile(url) df = spark.read.option("sep", "\t").csv("file://" + SparkFiles.get("clickstream-jawiki-2017-11.tsv.gz")) df.show(10)

Using wget:使用 wget:

 import wget url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz" local_path = '/tmp/wikipediadata/clickstream-jawiki-2017-11.tsv.gz' wget.download(url, local_path) df = spark.read.option("sep", "\t").csv('file://'+local_path) df.show(10)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 TFX 的 MLOps:使用来自 Keras 的序列时如何摄取数据? - MLOps with TFX: How to ingest data when using Sequence from Keras? 从可下载的 URL 将数据提取到 Blob 存储中,而无需下载文件 - Ingest data into Blob Storage from downloadable URLs without having to download the files 从 HDFS 读取时,Spark Structured Streaming 不写入数据 - Spark Structured Streaming not writing data while reading from HDFS 使用 Python 从 HDFS 目录读取文件并在 Spark 中创建 RDD - Reading files from HDFS directory and creating a RDD in Spark using Python 如何将二进制文件从 hdfs 读入 Spark 数据帧? - How can I read in a binary file from hdfs into a Spark dataframe? 如何将输出作为固定宽度文件从 spark 写入 hdfs? - How to write output as fixed width file from spark to hdfs? 如何使用 pip 从 pypi 下载源代码分发? - How to download source distribution from pypi using pip? HDFS:使用 Python3 从 HDFS 读取数据以解析 HDFS 中的 XML 文件 - HDFS: Read data from HDFS to parse XML files in HDFS using Python3 如何使用pyspark读取HDFS Kafka数据? - How to read hdfs kafka data using pyspark? 如何从python中的HDFS序列文件加载数据 - How to load data from HDFS sequencefile in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM