I have a function which generates a dataframe:
def getdata():
schema_1 = StructType([ StructField('path_name', StringType(), True),
StructField('age1', IntegerType(), True),
StructField('age2', IntegerType(), True),
StructField('age3', IntegerType(), True)])
data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
df = spark.createDataFrame(data,schema_1)
return df
I need to insert that dataframe schema into a column of another dataframe. The result should be something like:
root
|-- filename: string (nullable = true)
|-- parsed: struct (nullable = true)
| |-- path_name: string (nullable = true)
| |-- age1: integer (nullable = true)
| |-- age2: integer (nullable = true)
| |-- age3: integer (nullable = true)
I was trying to do it by using a udf:
@udf(schema_1)
def my_udf(schema):
data = getdata(schema)
List_of_rows = data.collect()
return List_of_rows
And then inserting it into another dataframe I am creating. The whole code I am using is:
from pyspark.sql.types import *
from pyspark.sql.functions import col
import pandas as pd
schema_1 = StructType([ StructField('path_name', StringType(), True),
StructField('age1', IntegerType(), True),
StructField('age2', IntegerType(), True),
StructField('age3', IntegerType(), True)])
def getdata(schema_1):
data = [('dbfs/123/sd.zip',1,2,3),('dbfs/123/ab.zip',5,6,7)]
df = spark.createDataFrame(data,schema_1)
return df
@udf(schema_1)
def my_udf(scheme):
data = getdata(scheme)
List_of_rows = data.collect()
return List_of_rows
def somefunction(schema_1):
pd_df = pd.DataFrame(['path'], columns = ["filename"])
return (spark.createDataFrame(pd_df)
.withColumn('parsed', my_udf(schema_1))
)
df_2 = somefunction(schema_1)
display(df_2)
However I am getting an error,
Error rendering output: Report error. PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
and I also think it is not the best approach. Any idea??
Can you not just create a custom schema? Google for that. Also, see the sample code which will create a (forced) custom schema for you, even if the first rows (headers) are missing, or incorrect.
from pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("asset_id", StringType(), True), \
StructField("price_date", StringType(), True), \
etc., etc., etc.,
StructField("close_price", StringType(), True), \
StructField("filename", StringType(), True)])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.