简体   繁体   English

使用 socketTextStream 将数据提取到 spark 中

[英]Ingesting data into spark using socketTextStream

I am fetching tweets from the twitter API and then forwarding them through a tcp connection into a socket where spark is reading data from.我正在从 twitter API 获取推文,然后通过 tcp 连接将它们转发到 spark 从中读取数据的套接字。 This is my code这是我的代码

For reference line will look something like this参考line看起来像这样

{
 data : {
         text: "some tweet",
         id: some number
      }
 matching_rules: [{tag: "some string", id: same number}, {tag:...}]
}
def ingest_into_spark(tcp_conn, stream):

    for line in stream.iter_lines():
        if not (line is None):
            try :
                # print(line)
                tweet = json.loads(line)['matching_rules'][0]['tag']
                # tweet = json.loads(line)['data']['text']
                print(tweet, type(tweet), len(tweet))
                tcp_conn.sendall(tweet.encode('utf-8'))

            except Exception as e:
                print("Exception in ingesting data: ", e)       

the spark side code:火花端代码:

    print(f"Connecting to {SPARK_IP}:{SPARK_PORT}...")
    input_stream = streaming_context.socketTextStream(SPARK_IP, int(SPARK_PORT)) 
    print(f"Connected to {SPARK_IP}:{SPARK_PORT}")
    tags = input_stream.flatMap(lambda tags: tags.strip().split()) 

    mapped_hashtags = tags.map(lambda hashtag: (hashtag, 1))

    counts=mapped_hashtags.reduceByKey(lambda a, b: a+b)
    counts.pprint()

spark is not reading the data sent over the stream no matter what I do.无论我做什么,spark 都不会读取通过 stream 发送的数据。 But when I replace the line tweet = json.loads(line)['matching_rules'][0]['tag'] with the line tweet = json.loads(line)['data']['text'] it suddenly works as expected.但是,当我将tweet = json.loads(line)['matching_rules'][0]['tag']行替换为tweet = json.loads(line)['data']['text']时,它突然按预期工作。 I have tried printing the content of tweet and its type in both lines and its string in both.我尝试在两行中打印推文的内容及其类型,并在两行中打印其字符串。 Only difference is the first one has the actual tweets while second only has 1 word.唯一的区别是第一个有实际的推文,而第二个只有一个词。

I have tried with many different types of inputs and hard-coding the input as well.我尝试过许多不同类型的输入并对输入进行硬编码。 But I cannot imagine why reading a different field of a json make my code to stop working.但我无法想象为什么阅读 json 的不同领域会使我的代码停止工作。

Replacing either the client or the server with.netcat shows that the data is being sent over the socket as expected in both cases用 .netcat 替换客户端或服务器表明在这两种情况下都按预期通过套接字发送数据

If there are no solutions to this I would be open to knowing about alternate ways of ingesting data into spark as well which could be used in this scenario如果没有解决方案,我愿意了解将数据摄取到 spark 的替代方法,也可以在这种情况下使用

As described in the documentation , records (lines) in text streams are delimited by new lines ( \n ).文档中所述,文本流中的记录(行)由换行符 ( \n ) 分隔。 Unlike print() , sendall() is a byte-oriented function and it does not automatically add a new line.print()不同, sendall()是面向字节的 function 并且它不会自动添加新行。 No matter how many tags you send with it, Spark will just keep on reading everything as a single record, waiting for the delimiter to appear.无论您发送了多少标签,Spark 都会继续将所有内容作为一条记录读取,等待分隔符出现。 When you send the tweet text instead, it works because some tweets do contain line breaks.当您改为发送推文文本时,它会起作用,因为某些推文确实包含换行符。

Try the following and see if it makes it work:尝试以下操作,看看它是否有效:

tcp_conn.sendall((tweet + '\n').encode('utf-8'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM