简体   繁体   English

从 Kafka 读取并写入镶木地板中的 hdfs

[英]Read from Kafka and write to hdfs in parquet

I am new to the BigData eco system and kind of getting started.我是 BigData 生态系统的新手,并且刚开始接触。

I have read several articles about reading a kafka topic using spark streaming but would like to know if it is possible to read from kafka using a spark job instead of streaming ?我已经阅读了几篇关于使用 Spark 流读取 kafka 主题的文章,但想知道是否可以使用 Spark 作业而不是流从 kafka 读取? If yes, could you guys help me in pointing out to some articles or code snippets that can get me started.如果是的话,你们能帮我指出一些可以让我入门的文章或代码片段吗?

My second part of the question is writing to hdfs in parquet format.我的问题的第二部分是以镶木地板格式写入 hdfs。 Once i read from Kafka , i assume i will have an rdd.一旦我从 Kafka 中读到,我想我会有一个 rdd。 Convert this rdd into a dataframe and then write the dataframe as a parquet file.将此 rdd 转换为数据帧,然后将数据帧写入镶木地板文件。 Is this the right approach.这是正确的方法吗。

Any help appreciated.任何帮助表示赞赏。

Thanks谢谢

For reading data from Kafka and writing it to HDFS, in Parquet format, using Spark Batch job instead of streaming, you can use Spark Structured Streaming .对于从 Kafka 读取数据并将其写入 HDFS,在 Parquet 格式中,使用 Spark Ba​​tch 作业而不是流式传输,您可以使用Spark Structured Streaming

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Structured Streaming 是一种基于 Spark SQL 引擎构建的可扩展且容错的流处理引擎。 You can express your streaming computation the same way you would express a batch computation on static data.您可以像在静态数据上表达批处理计算一样表达流式计算。 The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. Spark SQL 引擎将负责以增量方式连续运行它,并随着流数据的不断到达更新最终结果。 You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine.您可以使用 Scala、Java、Python 或 R 中的 Dataset/DataFrame API 来表达流聚合、事件时间窗口、流到批处理连接等。计算在同一个优化的 Spark SQL 引擎上执行。 Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.最后,系统通过检查点和预写日志确保端到端的一次性容错保证。 In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.简而言之,Structured Streaming 提供了快速、可扩展、容错、端到端的一次性流处理,用户无需对流进行推理。

It comes with Kafka as a built in Source, ie, we can poll data from Kafka.它带有 Kafka 作为内置 Source,即我们可以从 Kafka 轮询数据。 It's compatible with Kafka broker versions 0.10.0 or higher.它与 Kafka 代理版本 0.10.0 或更高版本兼容。

For pulling the data from Kafka in batch mode, you can create a Dataset/DataFrame for a defined range of offsets.为了以批处理模式从 Kafka 中提取数据,您可以为定义的偏移范围创建一个 Dataset/DataFrame。

// Subscribe to 1 topic defaults to the earliest and latest offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
  .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

// Subscribe to a pattern, at the earliest and latest offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

Each row in the source has the following schema:源中的每一行都具有以下架构:

| Column           | Type          |
|:-----------------|--------------:|
| key              |        binary |
| value            |        binary |
| topic            |        string |
| partition        |           int |
| offset           |          long |
| timestamp        |          long |
| timestampType    |           int |

Now, to write Data to HDFS in parquet format, following code can be written:现在,要将数据以 parquet 格式写入 HDFS,可以编写以下代码:

df.write.parquet("hdfs://data.parquet")

For more information on Spark Structured Streaming + Kafka, please refer to following guide - Kafka Integration Guide有关 Spark Structured Streaming + Kafka 的更多信息,请参阅以下指南 - Kafka 集成指南

I hope it helps!我希望它有帮助!

You already have a couple of good answers on the topic.你已经有几个关于这个话题的好答案。

Just wanted to stress out - be careful to stream directly into a parquet table.只是想强调一下 - 小心直接流入镶木地板。 Parquet's performance shines when parquet row group sizes are large enough (for simplicity, you can say file size should be in order of 64-256Mb for example), to take advantage of dictionary compression, bloom filters etc. (one parquet file can have multiple row chunks in it, and normally does have multiple row chunks in each file; although row chunks can't span multiple parquet files)当 parquet 行组大小足够大时,Parquet 的性能会大放异彩(为简单起见,您可以说文件大小应该在 64-256Mb 之间),以利用字典压缩、布隆过滤器等(一个 Parquet 文件可以有多个行块在其中,通常每个文件中有多个行块;虽然行块不能跨越多个镶木地板文件)

If you're streaming directly to a parquet table, then you'll end up very likely with a bunch of tiny parquet files (depending on mini-batch size of Spark Streaming, and volume of data).如果您直接流式传输到镶木地板,那么您最终很可能会得到一堆小型镶木地板文件(取决于 Spark Streaming 的小批量大小和数据量)。 Querying such files can be very slow.查询此类文件可能非常缓慢。 Parquet may require reading all files' headers to reconcile schema for example and it's a big overhead.例如,Parquet 可能需要读取所有文件的标头以协调架构,这是一个很大的开销。 If this is the case, you will need to have a separate process that will, for example, as a workaround, read older files, and writes them "merged" (this wouldn't be a simple file-level merge, a process would actually need to read in all parquet data and spill out larger parquet files).如果是这种情况,您将需要有一个单独的进程,例如,作为一种解决方法,读取旧文件并将它们“合并”写入(这不是简单的文件级合并,一个进程会实际上需要读入所有镶木地板数据并溢出更大的镶木地板文件)。

This workaround may kill the original purpose of data "streaming".此解决方法可能会破坏数据“流式传输”的原始目的。 You could look at other technologies here too - like Apache Kudu, Apache Kafka, Apache Druid, Kinesis etc that can work here better.您也可以在这里查看其他技术——比如 Apache Kudu、Apache Kafka、Apache Druid、Kinesis 等,它们可以在这里更好地工作。

Update: since I posted this answer, there is now a new strong player here - Delta Lake .更新:自从我发布了这个答案,现在这里有一个强大的新玩家 - Delta Lake https://delta.io/ If you're used to parquet, you'll find Delta very attractive (actually, Delta is built on top of parquet layer + metadata). https://delta.io/如果你习惯了镶木地板,你会发现 Delta 非常有吸引力(实际上,Delta 是建立在镶木地板 + 元数据之上)。 Delta Lake offers:三角洲湖提供:

ACID transactions on Spark: Spark 上的 ACID 事务:

  • Serializable isolation levels ensure that readers never see inconsistent data.可序列化的隔离级别确保读者永远不会看到不一致的数据。
  • Scalable metadata handling: Leverages Spark's distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.可扩展的元数据处理:利用 Spark 的分布式处理能力轻松处理 PB 级表的所有元数据,其中包含数十亿个文件。
  • Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink.流式和批处理统一:Delta Lake 中的表是批处理表,也是流式源和接收器。 Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.流数据摄取、批量历史回填、交互式查询都是开箱即用的。
  • Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.架构实施:自动处理架构变化以防止在摄取期间插入不良记录。
  • Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.时间旅行:数据版本控制支持回滚、完整的历史审计跟踪和可重复的机器学习实验。
  • Upserts and deletes: Supports merge, update and delete operations to enable complex usecases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on. Upserts 和 deletes:支持合并、更新和删除操作,以支持复杂的用例,如更改数据捕获、缓慢变化维度 (SCD) 操作、流式更新插入等。

Use Kafka Streams.使用 Kafka 流。 SparkStreaming is an misnomer (it's mini-batch under the hood, at least up to 2.2). SparkStreaming 用词不当(它是引擎盖下的小批量,至少高达 2.2)。

https://eng.verizondigitalmedia.com/2017/04/28/Kafka-to-Hdfs-ParquetSerializer/ https://eng.verizondigitalmedia.com/2017/04/28/Kafka-to-Hdfs-ParquetSerializer/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM