[英]spark rdd fliter by query mysql
I use spark streaming to stream data from Kafka and I want to filter data judge by data in MySql. 我使用Spark Streaming从Kafka流式传输数据,并且我想按MySql中的数据来筛选数据判断。
For example, I get data from kafka just like: 例如,我从kafka获得数据,就像:
{"id":1, "data":"abcdefg"}
and there are data in MySql like this: MySql中有这样的数据:
id | state
1 | "success"
I need to query the MySql to get the state of term id. 我需要查询MySql以获取术语ID的状态。 I can define a connect to MySql in the function of filter, and it works.
我可以在filter函数中定义一个与MySql的连接,并且可以正常工作。 The code like this:
像这样的代码:
def isSuccess(x):
id = x["id"]
sql = """
SELECT *
FROM Test
WHERE id = "{0}"
""".format(id)
conn = mysql_connection(......)
result = rdbi.query_one(sql)
if result == None:
return False
else:
return True
successRDD = rdd.filter(isSuccess)
But it will define connection for every row of the RDD, and will waste a lot of computing resource. 但是它将为RDD的每一行定义连接,并且将浪费大量计算资源。
How to do in filter? 如何做过滤器?
I suggest you go for using mapPartition
available in Apache Spark to prevent initialization of MySQL connection for every RDD. 我建议您使用Apache Spark中提供的
mapPartition
来防止为每个RDD初始化MySQL连接。
This is the MySQL table that I created: 这是我创建的MySQL表:
create table test2(id varchar(10), state varchar(10));
With the following values: 具有以下值:
+------+---------+
| id | state |
+------+---------+
| 1 | success |
| 2 | stopped |
+------+---------+
Use the following PySpark Code as reference: 使用以下PySpark代码作为参考:
import MySQLdb
data1=[["1", "afdasds"],["2","dfsdfada"],["3","dsfdsf"]] #sampe data, in your case streaming data
rdd = sc.parallelize(data1)
def func1(data1):
con = MySQLdb.connect(host="127.0.0.1", user="root", passwd="yourpassword", db="yourdb")
c=con.cursor()
c.execute("select * from test2;")
data=c.fetchall()
dict={}
for x in data:
dict[x[0]]=x[1]
list1=[]
for x in data1:
if x[0] in dict:
list1.append([x[0], x[1], dict[x[0]]])
else:
list1.append([x[0], x[1], "none"]) # i assign none if id in table and one received from streaming dont match
return iter(list1)
print rdd.mapPartitions(func1).filter(lambda x: "none" not in x[2]).collect()
The output that i got was: 我得到的输出是:
[['1', 'afdasds', 'success'], ['2', 'dfsdfada', 'stopped']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.