从 1 个 kafka 主题中获取 2 个不同的数据到 2 个数据帧中

Question

I have a homework like this:我有这样的作业：

Use python to read json files in 2 folders song_data and log_data.使用 python 读取 2 个文件夹 song_data 和 log_data 中的 json 个文件。
Use Python Kafka to publish a mixture of both song_data and log_data file types into a Kafka topic.使用 Python Kafka 将 song_data 和 log_data 文件类型的混合发布到 Kafka 主题中。
Use Pyspark to consume data from the above Kafka topic.使用 Pyspark 消费来自上述 Kafka 主题的数据。
Use Stream processing to consume messages from song_data and create 2 dataframes, songs and artitst.使用 Stream 处理来消费来自 song_data 的消息并创建 2 个数据帧，songs 和 artitst。 and from log_data generate dataframe as users, time.并从 log_data 生成 dataframe 作为用户、时间。
Create songplays from dataframes of dimension tables.从维度表的数据帧创建歌曲。

I have a problems with read different file from 1 topic, 2 folder containt json file but 1 is song data and 1 is log.我在从 1 个主题读取不同文件时遇到问题，2 个文件夹包含 json 文件，但 1 个是歌曲数据，1 个是日志。 How can I get their own data from just 1 topics?我怎样才能从 1 个主题中获取自己的数据？

Answer 1

Unclear why you cannot just use two topics, one for each file.不清楚为什么不能只使用两个主题，每个文件一个。 Especially if they don't have matching schemas, which will be important for SparkSQL.特别是如果它们没有匹配的模式，这对 SparkSQL 来说很重要。

How can I get their own data from just 1 topics?我怎样才能从 1 个主题中获取自己的数据？

It begins at step 2.它从第 2 步开始。

Write the data to your single topic in a format like so ( content used for example purposes only)以这样的格式将数据写入您的单个主题（ content仅用于示例目的）

{"type": "song", "content": "..."}

or或者

{"type": "log", "content": "..."}

Then, in SparkSQL, you can do something like this然后，在 SparkSQL 中，你可以做这样的事情

df = spark.readStream.format("kafka")... # TODO: apply a schema to the data to get a "type" column
song_data = df.where(df("type") == "song").select("content")
log_data = df.where(df("type") == "log").select("content")

You could also do the same filtering in Python-Kafka and not need dataframes or a Spark environment.您也可以在 Python-Kafka 中进行相同的过滤，而不需要数据帧或 Spark 环境。

从 1 个 kafka 主题中获取 2 个不同的数据到 2 个数据帧中

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-13 21:47:04

从 1 个 kafka 主题中获取 2 个不同的数据到 2 个数据帧中

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-13 21:47:04

解决方案1
1 已采纳 2022-04-13 21:47:04