[英]Set startingPosition in Event Hub on Databricks
I am trying to read a stream of events from EventHub using PySpark.我正在尝试使用 PySpark 从 EventHub 读取事件流。 I have a problem setting the starting position to the beginning of the stream.
我在将起始位置设置为流的开头时遇到问题。 It is clear in Scala, but for Python I keep getting:
在 Scala 中很清楚,但对于 Python,我不断得到:
org.json4s.package$MappingException: No usable value for offset.
This is my configuration.这是我的配置。
conf = {
"eventhubs.connectionString":
"Endpoint=sb://XXXX;SharedAccessKeyName=XXX;SharedAccessKey=XXXX;EntityPath=XXXX",
"eventhubs.consumerGroup": "$Default",
"eventhubs.startingPosition": "-1"
}
In Scala在斯卡拉
val cs = "YOUR.CONNECTION.STRING"
val ehConf = EventHubsConf(cs)
.setStartingPosition(EventPosition.fromEndOfStream)
Reference: Event Hubs Configuration in Scala参考: Scala 中的事件中心配置
In Python via PySpark在 Python 中通过 PySpark
ehConf = {'eventhubs.connectionString' : connectionString}
startTime = "2020-04-07T01:05:05.662231Z"
endTime = "2020-04-07T01:15:05.662185Z"
startingEventPosition = {
"offset": None,
"seqNo": -1, #not in use
"enqueuedTime": startTime,
"isInclusive": True
}
endingEventPosition = {
"offset": None, #not in use
"seqNo": -1, #not in use
"enqueuedTime": endTime,
"isInclusive": True
}
# Put the positions into the Event Hub config dictionary
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
ehConf["eventhubs.endingPosition"] = json.dumps(endingEventPosition)
df = spark.read.format("eventhubs").options(**ehConf).load()
In Python via SDK在 Python 中通过 SDK
Consume events from an Event Hub asynchronously异步使用来自事件中心的事件
import logging
import asyncio
from azure.eventhub.aio import EventHubConsumerClient
connection_str = '<< CONNECTION STRING FOR THE EVENT HUBS NAMESPACE >>'
consumer_group = '<< CONSUMER GROUP >>'
eventhub_name = '<< NAME OF THE EVENT HUB >>'
logger = logging.getLogger("azure.eventhub")
logging.basicConfig(level=logging.INFO)
async def on_event(partition_context, event):
logger.info("Received event from partition {}".format(partition_context.partition_id))
await partition_context.update_checkpoint(event)
async def receive():
client = EventHubConsumerClient.from_connection_string(connection_str, consumer_group, eventhub_name=eventhub_name)
async with client:
await client.receive(
on_event=on_event,
starting_position="-1", # "-1" is from the beginning of the partition.
)
# receive events from specified partition:
# await client.receive(on_event=on_event, partition_id='0')
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(receive())
Consume events from an Event Hub in batches asynchronously以异步方式批量使用来自事件中心的事件
import logging
import asyncio
from azure.eventhub.aio import EventHubConsumerClient
connection_str = '<< CONNECTION STRING FOR THE EVENT HUBS NAMESPACE >>'
consumer_group = '<< CONSUMER GROUP >>'
eventhub_name = '<< NAME OF THE EVENT HUB >>'
logger = logging.getLogger("azure.eventhub")
logging.basicConfig(level=logging.INFO)
async def on_event_batch(partition_context, events):
logger.info("Received event from partition {}".format(partition_context.partition_id))
await partition_context.update_checkpoint()
async def receive_batch():
client = EventHubConsumerClient.from_connection_string(connection_str, consumer_group, eventhub_name=eventhub_name)
async with client:
await client.receive_batch(
on_event_batch=on_event_batch,
starting_position="-1", # "-1" is from the beginning of the partition.
)
# receive events from specified partition:
# await client.receive_batch(on_event_batch=on_event_batch, partition_id='0')
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(receive_batch())
Consume events and save checkpoints using a checkpoint store.使用检查点存储使用事件并保存检查点。
import asyncio
from azure.eventhub.aio import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblobaio import BlobCheckpointStore
connection_str = '<< CONNECTION STRING FOR THE EVENT HUBS NAMESPACE >>'
consumer_group = '<< CONSUMER GROUP >>'
eventhub_name = '<< NAME OF THE EVENT HUB >>'
storage_connection_str = '<< CONNECTION STRING FOR THE STORAGE >>'
container_name = '<<NAME OF THE BLOB CONTAINER>>'
async def on_event(partition_context, event):
# do something
await partition_context.update_checkpoint(event) # Or update_checkpoint every N events for better performance.
async def receive(client):
await client.receive(
on_event=on_event,
starting_position="-1", # "-1" is from the beginning of the partition.
)
async def main():
checkpoint_store = BlobCheckpointStore.from_connection_string(storage_connection_str, container_name)
client = EventHubConsumerClient.from_connection_string(
connection_str,
consumer_group,
eventhub_name=eventhub_name,
checkpoint_store=checkpoint_store, # For load balancing and checkpoint. Leave None for no load balancing
)
async with client:
await receive(client)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Reference: Event Hubs Configuration in Python参考: Python 中的事件中心配置
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.