简体   繁体   English

通过查询mysql来发火花rdd

[英]spark rdd fliter by query mysql

I use spark streaming to stream data from Kafka and I want to filter data judge by data in MySql. 我使用Spark Streaming从Kafka流式传输数据,并且我想按MySql中的数据来筛选数据判断。

For example, I get data from kafka just like: 例如,我从kafka获得数据,就像:

{"id":1, "data":"abcdefg"}

and there are data in MySql like this: MySql中有这样的数据:

id  | state  
1   | "success"

I need to query the MySql to get the state of term id. 我需要查询MySql以获取术语ID的状态。 I can define a connect to MySql in the function of filter, and it works. 我可以在filter函数中定义一个与MySql的连接,并且可以正常工作。 The code like this: 像这样的代码:

def isSuccess(x):
    id = x["id"]
    sql = """
        SELECT * 
        FROM Test
        WHERE id = "{0}"
        """.format(id)
    conn = mysql_connection(......)
    result = rdbi.query_one(sql)
    if result == None:
        return False
    else:
        return True
successRDD = rdd.filter(isSuccess)

But it will define connection for every row of the RDD, and will waste a lot of computing resource. 但是它将为RDD的每一行定义连接,并且将浪费大量计算资源。

How to do in filter? 如何做过滤器?

I suggest you go for using mapPartition available in Apache Spark to prevent initialization of MySQL connection for every RDD. 我建议您使用Apache Spark中提供的mapPartition来防止为每个RDD初始化MySQL连接。

This is the MySQL table that I created: 这是我创建的MySQL表:

create table test2(id varchar(10), state varchar(10));

With the following values: 具有以下值:

+------+---------+
| id   | state   |
+------+---------+
| 1    | success |
| 2    | stopped |
+------+---------+

Use the following PySpark Code as reference: 使用以下PySpark代码作为参考:

import MySQLdb

data1=[["1", "afdasds"],["2","dfsdfada"],["3","dsfdsf"]] #sampe data, in your case streaming data
rdd = sc.parallelize(data1)

def func1(data1):
    con = MySQLdb.connect(host="127.0.0.1", user="root", passwd="yourpassword", db="yourdb")
    c=con.cursor()
    c.execute("select * from test2;")
    data=c.fetchall()
    dict={}
    for x in data:
        dict[x[0]]=x[1]
    list1=[]
    for x in data1:
        if x[0] in dict:
            list1.append([x[0], x[1], dict[x[0]]])
        else:
            list1.append([x[0], x[1], "none"]) # i assign none if id in table and one received from streaming dont match
    return iter(list1)

print rdd.mapPartitions(func1).filter(lambda x: "none" not in x[2]).collect()

The output that i got was: 我得到的输出是:

[['1', 'afdasds', 'success'], ['2', 'dfsdfada', 'stopped']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM