简体   繁体   English

在 pyspark 数据帧中插入数据时出错

[英]Error while inserting data in pyspark data frame

I have a sample pyspark code where I am trying to generate a json structure.我有一个示例 pyspark 代码,我正在尝试生成 json 结构。 Below is the code下面是代码

def func(row):
    temp=row.asDict()
    headDict = {}
    headDict['type'] = "record"
    headDict['name'] = "source"
    headDict['namespace'] = "com.streaming.event"
    headDict['doc'] = "SCD signals from  source"
    fieldslist = []
    headDict['fields'] = fieldslist
    for i in temp:
        fieldslist.append({i:temp[i]})
    return (json.dumps(headDict))
if __name__ == "__main__":
    spark = SparkSession.builder.master("local[*]").appName("PythonWordCount").getOrCreate()
    payload=udf(func,StringType())
    data = spark.createDataFrame(
        [
            (1, "a", 'foo1'),  # create your data here, be consistent in the types.
            (2, "b", 'bar'),
            (3, "c", 'mnc')
        ],
        ['id', 'nm', 'txt']  # add your columns label here
    )
    df=data.withColumn("payload1",payload(struct([data[x] for x in data.columns])))
    df.show(3,False)

I am getting an error while inserting data into dataframe将数据插入 dataframe 时出现错误

  raise ValueError("Unexpected tuple %r with StructType" % obj)
ValueError: Unexpected tuple '{"namespace": "com.streaming.event", "type": "record", "name": "source", "fields": [{"txt": "mnc"}, {"id": 3}, {"nm": "c"}], "doc": "SCD signals from  source"}' with StructType

If I am trying to print the json payload I am getting correct output如果我尝试打印 json 有效负载,我得到正确的 output

{"namespace": "com.streaming.event", "type": "record", "name": "source", "fields": [{"txt": "mnc"}, {"id": 3}, {"nm": "c"}], "doc": "SCD signals from  source"}

I have also verified this is a valid json.我还验证了这是一个有效的 json。

I am not sure what I am missing here.我不确定我在这里缺少什么。

Could this be a python version issue?I am using python 2.7这可能是 python 版本问题吗?我使用的是 python 2.7

Update-I tried to run the exact same code using python 3.7 and it is running fine更新-我尝试使用 python 3.7 运行完全相同的代码,并且运行良好

it works for me in spark 3.x with python 2.7.x.,它在带有 python 2.7.x 的 spark 3.x 中对我有用,

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 2.7.17 (default, Jul 20 2020 15:37:01)
SparkSession available as 'spark'.

results from pyspark shell结果来自 pyspark shell

import json
from pyspark.sql.functions import * 
from pyspark.sql.types import *

def func(row):
    temp=row.asDict()
    headDict = {}
    headDict['type'] = "record"
    headDict['name'] = "source"
    headDict['namespace'] = "com.streaming.event"
    headDict['doc'] = "SCD signals from  source"
    fieldslist = []
    headDict['fields'] = fieldslist
    for i in temp:
        fieldslist.append({i:temp[i]})
    return (json.dumps(headDict))

spark = SparkSession.builder.master("local[*]").appName("PythonWordCount").getOrCreate()
payload=udf(func,StringType())
data = spark.createDataFrame([(1, "a", 'foo1'),     (2, "b", 'bar'),    (3, "c", 'mnc')],['id', 'nm', 'txt'])
data.show()
'''
+---+---+----+                                                                  
| id| nm| txt|
+---+---+----+
|  1|  a|foo1|
|  2|  b| bar|
|  3|  c| mnc|
+---+---+----+
'''


df=data.withColumn("payload1",payload(struct([data[x] for x in data.columns])))
df.show(3,False)
'''
+---+---+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |nm |txt |payload1                                                                                                                                                        |
+---+---+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |a  |foo1|{"namespace": "com.streaming.event", "type": "record", "name": "source", "fields": [{"txt": "foo1"}, {"id": 1}, {"nm": "a"}], "doc": "SCD signals from  source"}|
|2  |b  |bar |{"namespace": "com.streaming.event", "type": "record", "name": "source", "fields": [{"txt": "bar"}, {"id": 2}, {"nm": "b"}], "doc": "SCD signals from  source"} |
|3  |c  |mnc |{"namespace": "com.streaming.event", "type": "record", "name": "source", "fields": [{"txt": "mnc"}, {"id": 3}, {"nm": "c"}], "doc": "SCD signals from  source"} |
+---+---+----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM