Python字典键值进入Pyspark中的dataframe where子句

Question

How can I pass a Python dictionary key value into dataframe where clause in Pyspark ... 如何将Python字典键值传递给Pyspark中的dataframe where子句...

Python dictionary as below ... Python字典如下......

column_dict= { 'email': 'customer_email_addr' ,
               'addr_bill': 'crq_st_addr' ,
               'addr_ship': 'ship_to_addr' ,
               'zip_bill': 'crq_zip_cd' ,
               'zip_ship':  'ship_to_zip' ,
               'phone_bill': 'crq_cm_phone' ,
               'phone_ship' : 'ship_to_phone'}

I've a spark dataframe with around 3 billion records. 我有一个大约有30亿条记录的火花数据框。 Dataframe as follows ... 数据帧如下......

source_sql= ("select cust_id, customer_email_addr, crq_st_addr, ship_to_addr,
 crq_zip_cd,ship_to_zip,crq_cm_phone,ship_to_phone from odl.cust_master  where
 trans_dt >= '{}' and trans_dt <= '{}' ").format('2017-11-01','2018-10-31')

cust_id_m = hiveCtx.sql(source_sql)
cust_id.cache()

My intention to find out distinct valid customer's for Email, Addr, Zip and Phone and run in loop for above dictionary keys. 我打算找出电子邮件，地址，邮编和电话的不同有效客户，并在上面的字典键中循环运行。 For this when I test spark shell for one key value as below ... 为此，当我测试火花壳的一个键值如下...

>>> cust_id_risk_m=cust_id_m.selectExpr("cust_id").where( 
("cust_id_m.'{}'").format(column_dict['email'])  != ''  ).distinct()

I'm getting error ... Need experts assistance in resolving this. 我收到错误......需要专家帮助解决这个问题。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/dataframe.py", line 1026, in filter
    raise TypeError("condition should be string or Column")
TypeError: condition should be string or Column

Answer 1

Can you try using get method on your dictionary? 你能尝试在字典上使用get方法吗？ I have tested this with below dataframe as: 我用以下数据框测试了这个：

df =spark.sql("select emp_id, emp_name, emp_city,emp_salary from udb.emp_table  where emp_joining_date >= '{}' ".format(2018-12-05))

>>> df.show(truncate=False)
+------+----------------------+--------+----------+
|emp_id|emp_name              |emp_city|emp_salary|
+------+----------------------+--------+----------+
|1     |VIKRANT SINGH RANA    |NOIDA   |10000     |
|3     |GOVIND NIMBHAL        |DWARKA  |92000     |
|2     |RAGHVENDRA KUMAR GUPTA|GURGAON |50000     |
+------+----------------------+--------+----------+

thedict={"CITY":"NOIDA"}

>>> newdf = df.selectExpr("emp_id").where("emp_city ='{}'".format(thedict.get('CITY'))).distinct()
>>> newdf.show();
+------+
|emp_id|
+------+
|     1|
+------+

or you can share your sample data for your dataframe? 或者您可以共享数据帧的示例数据？

Python字典键值进入Pyspark中的dataframe where子句

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-12-14 14:51:23

Python字典键值进入Pyspark中的dataframe where子句

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-12-14 14:51:23

解决方案1
2 已采纳 2018-12-14 14:51:23