Databricks Pyspark + 如何将数据帧架构作为列插入到数据帧中

Question

我有一个生成数据框的函数：

def getdata():
    schema_1 = StructType([ StructField('path_name', StringType(), True),
                           StructField('age1', IntegerType(), True), 
                           StructField('age2', IntegerType(), True), 
                           StructField('age3', IntegerType(), True)])
    data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
    df = spark.createDataFrame(data,schema_1)
    return df

我需要将该数据帧模式插入到另一个数据帧的列中。 结果应该是这样的：

root
 |-- filename: string (nullable = true)
 |-- parsed: struct (nullable = true)
 |    |-- path_name: string (nullable = true)
 |    |-- age1: integer (nullable = true)
 |    |-- age2: integer (nullable = true)
 |    |-- age3: integer (nullable = true)

我试图通过使用 udf 来做到这一点：

@udf(schema_1)
def my_udf(schema):
  data = getdata(schema)
  List_of_rows = data.collect()  
  return List_of_rows

然后将其插入到我正在创建的另一个数据帧中。 我使用的整个代码是：

from pyspark.sql.types import *
from pyspark.sql.functions import col
import pandas as pd

schema_1 = StructType([ StructField('path_name', StringType(), True),
                           StructField('age1', IntegerType(), True), 
                           StructField('age2', IntegerType(), True), 
                           StructField('age3', IntegerType(), True)])

def getdata(schema_1):    
    data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
    df = spark.createDataFrame(data,schema_1)
    return df


@udf(schema_1)
def my_udf(scheme):
  data = getdata(scheme)
  List_of_rows = data.collect()  
  return List_of_rows

def somefunction(schema_1):    
  pd_df = pd.DataFrame(['path'], columns = ["filename"])  
  return (spark.createDataFrame(pd_df)
          .withColumn('parsed', my_udf(schema_1))
         )

df_2 = somefunction(schema_1)

display(df_2)

但是我收到一个错误，

错误呈现输出：报告错误。 PicklingError：无法序列化对象：异常：您似乎正在尝试从广播变量、操作或转换中引用 SparkContext。 SparkContext 只能在驱动程序上使用，不能在它在工作程序上运行的代码中使用。 有关更多信息，请参阅 SPARK-5063。

而且我也认为这不是最好的方法。 任何的想法？？

Answer 1

你不能只创建一个自定义模式吗？ 谷歌。 此外，请参阅示例代码，该代码将为您创建（强制）自定义架构，即使第一行（标题）丢失或不正确。

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

customSchema = StructType([ \
StructField("asset_id", StringType(), True), \
StructField("price_date", StringType(), True), \

etc., etc., etc., 

StructField("close_price", StringType(), True), \
StructField("filename", StringType(), True)])

Databricks Pyspark + 如何将数据帧架构作为列插入到数据帧中

问题描述

1 个解决方案

解决方案1
0 2020-02-19 02:38:36

Databricks Pyspark + 如何将数据帧架构作为列插入到数据帧中

问题描述

1 个解决方案

解决方案1 0 2020-02-19 02:38:36

解决方案1
0 2020-02-19 02:38:36