[英]Get 2 different data from 1 kafka topic into 2 dataframes
I have a homework like this:我有这样的作业:
I have a problems with read different file from 1 topic, 2 folder containt json file but 1 is song data and 1 is log.我在从 1 个主题读取不同文件时遇到问题,2 个文件夹包含 json 文件,但 1 个是歌曲数据,1 个是日志。 How can I get their own data from just 1 topics?
我怎样才能从 1 个主题中获取自己的数据?
Unclear why you cannot just use two topics, one for each file.不清楚为什么不能只使用两个主题,每个文件一个。 Especially if they don't have matching schemas, which will be important for SparkSQL.
特别是如果它们没有匹配的模式,这对 SparkSQL 来说很重要。
How can I get their own data from just 1 topics?
我怎样才能从 1 个主题中获取自己的数据?
It begins at step 2.它从第 2 步开始。
Write the data to your single topic in a format like so ( content
used for example purposes only)以这样的格式将数据写入您的单个主题(
content
仅用于示例目的)
{"type": "song", "content": "..."}
or或者
{"type": "log", "content": "..."}
Then, in SparkSQL, you can do something like this然后,在 SparkSQL 中,你可以做这样的事情
df = spark.readStream.format("kafka")... # TODO: apply a schema to the data to get a "type" column
song_data = df.where(df("type") == "song").select("content")
log_data = df.where(df("type") == "log").select("content")
You could also do the same filtering in Python-Kafka and not need dataframes or a Spark environment.您也可以在 Python-Kafka 中进行相同的过滤,而不需要数据帧或 Spark 环境。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.