简体   繁体   English

使用Python中的kafka生成器发送数据的问题(Jupyter Notebook)

[英]Problem sending data with kafka producer in Python (Jupyter Notebook)

i'm trying to create a Big Data analysis using Kafka, Python and Twitter. 我正在尝试使用Kafka,Python和Twitter创建大数据分析。 I have a data stream of tweets that i only take the hashtag of them. 我有一个推文数据流,我只采用它们的标签。 My problem goes with the producer Kafka have for use in Python. 我的问题与制作人Kafka在Python中使用有关。 I can't send the data i want into the topic i created because i don't see any option to send the content of a variable with the producer. 我无法将我想要的数据发送到我创建的主题中,因为我没有看到任何选项来向生产者发送变量的内容。

In https://kafka-python.readthedocs.io/en/master/usage.html i can only see the option to send a exact string with b'some_string' . https://kafka-python.readthedocs.io/en/master/usage.html中,我只能看到使用b'some_string'发送精确字符串的选项。 But i want to send the hashtag i take from the Twitter Stream. 但我想发送我从Twitter Stream中获取的标签。 I don't know much about Python so excuse me if the solution is obvious. 我不太了解Python,所以如果解决方案很明显,请原谅。

Imports: 进口:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
import kafka
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer

Streaming Context: 流式上下文:

ssc = StreamingContext(sc,60)

Keys: 键:

consumer_key="consumer_key"
consumer_secret="consumer_secret"
access_token="access_token"
access_token_secret="access_token_secret"

Tweepy: Tweepy:

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Producer: 制片人:

producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

Code: 码:

class MyStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        for hashtag in status.entities['hashtags']:
            prueba = b'hashtag["text"]'
            producer.send('topic', prueba)
            return True
    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False

StreamListener: StreamListener:

myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=MyStreamListener())

Tweet Stream: 推文流:

myStream.filter(track=['some_text'])

The thing is, the producer only send the literal string of prueba that is "(hashtag["text"])" . 问题是,生产者只发送"(hashtag["text"])"的文字字符串prueba I want to send not the exact thing but the content of it. 我想发送的不是确切的东西,而是它的内容。

Thanks in advance. 提前致谢。

How about producer.send('topic', hashtag) ? producer.send('topic', hashtag)怎么样? You will also need to make sure to encode the data to raw bytes, which is what kafka stores. 您还需要确保将数据编码为原始字节,这是kafka存储的内容。 If hashtag is a simple string, you could do producer.send('topic', hashtag.encode('utf-8')) . 如果hashtag是一个简单的字符串,你可以做producer.send('topic', hashtag.encode('utf-8')) If it is a dict or a more complex data structure, you may need to use json.dumps before encoding to bytes. 如果它是dict或更复杂的数据结构,则可能需要在编码为字节之前使用json.dumps。 Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM