简体   繁体   English

将 XML 文件内容发送到事件中心并从 Databricks 中读取

[英]Sending XML file content to Event Hub and read it from Databricks

I'm trying to send xml files (less than 100 kb) to an Azure Event Hub and then, after sending them, read the events in Databricks.我正在尝试将 xml 文件(小于 100 kb)发送到 Azure 事件中心,然后在发送后读取 Databricks 中的事件。

Now I have used the Python SDK to send the content of the XML in bytes (this step WORKS ).现在我已经使用 Python SDK 以字节为单位发送XML的内容(此步骤有效)。 But the next step I would like to achive is reading that XML content from the "body" of the event and create a Spark Dataframe using PYSPARK .但我想实现的下一步是从事件的“正文”中读取 XML 内容,并使用PYSPARK 创建一个 Spark Dataframe

To be able to do this, I have two doubts:为了能够做到这一点,我有两个疑问:

1- Is there any option where I specify in the spark.readStream option that the content of the "body" of the event is an XML? 1-有没有我在spark.readStream选项中指定事件“正文”的内容是 XML 的选项?

2- Is there any alternative to dump that content directly to a Spark Dataframe? 2-是否有任何替代方法可以将该内容直接转储到 Spark Dataframe?

3- I'm missing some configuration when sending the XML as events? 3- 我在发送 XML 作为事件时缺少一些配置?

I was trying like the example below:我正在尝试像下面的示例:

Python event producer Python 事件制作人

# this is the python event hub message producer
import asyncio
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
import xml.etree.ElementTree as ET
from lxml import etree
from pathlib import Path

connection_str= "Endpoint_str"
eventhub_name = "eventhub_name"

xml_path = Path("path/to/xmlfile.xml")

xml_data = ET.parse(xml_path)
tree = xml_data.getroot()
data = ET.tostring(tree)

async def run():
    # Create a producer client to send messages to the event hub.
    # Specify a connection string to your event hubs namespace and
    # the event hub name.
    producer = EventHubProducerClient.from_connection_string(conn_str=connection_str, eventhub_name=eventhub_name)
    async with producer:
        # Create a batch.
        event_data_batch = await producer.create_batch()

        # Add events to the batch.
        event_data_batch.add(EventData(data))

        # Send the batch of events to the event hub.
        await producer.send_batch(event_data_batch)

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

Event reader事件阅读器

stream_data = spark \
    .readStream \
    .format('eventhubs') \
    .options(**event_hub_conf) \
    .option('multiLine', True) \
    .option('mode', 'PERMISSIVE') \
    .load()

Thanks!!!谢谢!!!

So I finally came with the next approach to read XML from Event Hub body.因此,我终于采用了下一种方法来从 Event Hub 正文中读取 XML。

First I use the import xml.etree.ElementTree as ET library to parse the XML structure.首先我使用import xml.etree.ElementTree as ET库来解析 XML 结构。

stream_data = spark \
    .readStream \
    .format('eventhubs') \
    .options(**event_hub_conf) \
    .option('multiLine', True) \
    .option('mode', 'PERMISSIVE') \
    .load() \
    .select("body")

df = stream_data.withColumn("body", stream_data["body"].cast("string"))

import xml.etree.ElementTree as ET
import json

def returnV(col):
  elem_dict= {}
  tag_list = [
    './TAG/Document/id',
    './TAG/Document/car',
    './TAG/Document/motor',
    './Metadata/Date']
  
  root = ET.fromstring(col)
  
  for tag in tag_list:
    for item in root.findall(tag):
      elem_dict[item.tag] = item.text
  return json.dumps(elem_dict)

I had some nested TAGs and with this method I'm extracting all the needed values and returning them as JSON.我有一些嵌套的标签,通过这种方法,我提取了所有需要的值并将它们作为 JSON 返回。 What I have learned is that Structured Streaming is not the solution if the incoming schema can change.我了解到的是,如果传入的架构可以更改,结构化流不是解决方案。 So I took only those values that I know that they are not going to change during time.所以我只选择了那些我知道它们不会随着时间而改变的价值观。

Then, once defined the method, I regist it as UDF.然后,一旦定义了方法,我将其注册为 UDF。

extractValuesFromXML = udf(returnV)
XML_DF= df.withColumn("body",extractValuesFromXML("body"))

Then finally I just use get_json_object function to extract the values of the JSON然后最后我只使用get_json_object function 来提取 JSON 的值

input_parsed_df = XML_DF.select(
  get_json_object("body", "$.id").alias("id").cast('integer'), 
  get_json_object("body", "$.car").alias("car"),
  get_json_object("body", "$.motor").alias("motor"),
  get_json_object("body", "$.Date").alias("Date")

)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM