简体   繁体   English

Databricks Pyspark + 如何将数据帧架构作为列插入到数据帧中

[英]Databricks Pyspark + How to insert a dataframe schema as a column in a dataframe

I have a function which generates a dataframe:我有一个生成数据框的函数:

def getdata():
    schema_1 = StructType([ StructField('path_name', StringType(), True),
                           StructField('age1', IntegerType(), True), 
                           StructField('age2', IntegerType(), True), 
                           StructField('age3', IntegerType(), True)])
    data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
    df = spark.createDataFrame(data,schema_1)
    return df

I need to insert that dataframe schema into a column of another dataframe.我需要将该数据帧模式插入到另一个数据帧的列中。 The result should be something like:结果应该是这样的:

root
 |-- filename: string (nullable = true)
 |-- parsed: struct (nullable = true)
 |    |-- path_name: string (nullable = true)
 |    |-- age1: integer (nullable = true)
 |    |-- age2: integer (nullable = true)
 |    |-- age3: integer (nullable = true)

I was trying to do it by using a udf:我试图通过使用 udf 来做到这一点:

@udf(schema_1)
def my_udf(schema):
  data = getdata(schema)
  List_of_rows = data.collect()  
  return List_of_rows

And then inserting it into another dataframe I am creating.然后将其插入到我正在创建的另一个数据帧中。 The whole code I am using is:我使用的整个代码是:

from pyspark.sql.types import *
from pyspark.sql.functions import col
import pandas as pd

schema_1 = StructType([ StructField('path_name', StringType(), True),
                           StructField('age1', IntegerType(), True), 
                           StructField('age2', IntegerType(), True), 
                           StructField('age3', IntegerType(), True)])

def getdata(schema_1):    
    data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
    df = spark.createDataFrame(data,schema_1)
    return df


@udf(schema_1)
def my_udf(scheme):
  data = getdata(scheme)
  List_of_rows = data.collect()  
  return List_of_rows

def somefunction(schema_1):    
  pd_df = pd.DataFrame(['path'], columns = ["filename"])  
  return (spark.createDataFrame(pd_df)
          .withColumn('parsed', my_udf(schema_1))
         )

df_2 = somefunction(schema_1)

display(df_2)

However I am getting an error,但是我收到一个错误,

Error rendering output: Report error.错误呈现输出:报告错误。 PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. PicklingError:无法序列化对象:异常:您似乎正在尝试从广播变量、操作或转换中引用 SparkContext。 SparkContext can only be used on the driver, not in code that it run on workers. SparkContext 只能在驱动程序上使用,不能在它在工作程序上运行的代码中使用。 For more information, see SPARK-5063.有关更多信息,请参阅 SPARK-5063。

and I also think it is not the best approach.而且我也认为这不是最好的方法。 Any idea??任何的想法??

Can you not just create a custom schema?你不能只创建一个自定义模式吗? Google for that.谷歌。 Also, see the sample code which will create a (forced) custom schema for you, even if the first rows (headers) are missing, or incorrect.此外,请参阅示例代码,该代码将为您创建(强制)自定义架构,即使第一行(标题)丢失或不正确。

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

customSchema = StructType([ \
StructField("asset_id", StringType(), True), \
StructField("price_date", StringType(), True), \

etc., etc., etc., 

StructField("close_price", StringType(), True), \
StructField("filename", StringType(), True)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 pyspark 在数据块中循环 dataframe 列 - How to loop dataframe column in databricks using pyspark Pyspark - 如何根据数据帧 2 中的列值在数据帧 1 中插入记录 - Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2 如何将 PySpark dataframe 插入具有雪花模式的数据库中? - How can I insert a PySpark dataframe into a database with a snowflake schema? 如何在 Databricks 的 PySpark 中使用 Scala 中创建的 DataFrame - How to Use DataFrame Created in Scala in Databricks' PySpark 在pyspark Dataframe上创建新的架构或列名称 - Create new schema or column names on pyspark Dataframe 基于数据块上另一个 pyspark 数据帧的某些列,在大型 pyspark 数据帧的列上执行用户定义的函数 - Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks 在 pyspark dataframe 上导入架构 - Import Schema on pyspark dataframe 将架构推断为 DataFrame pyspark - Infer an schema to DataFrame pyspark PySpark在Dataframe列中插入常量SparseVector - PySpark insert a constant SparseVector in a Dataframe column 如何在PySpark中拆分数据框列 - How to split dataframe column in PySpark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM