I am trying to compare two pandas dataframes but I get an error as 'DataFrame' object has no attribute 'withColumn'. What could be the issue?
import pandas as pd
import pyspark.sql.functions as F
pd_df=pd.DataFrame(df.dtypes,columns=['column','data_type'])
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd.merge(pd_df,pd_df1, on='column', how='outer'
).withColumn(
"result",
F.when(F.col("data_type_x") == 'NaN','new attribute'.otherwise('old attribute')))
.select(
"column",
"data_type_x",
"data_type_y",
"result"
)
df and df1 are some data frames
I figured it out. Thanks for the help.
def res(df):
if df['data_type_x'] == df['data_type_y']:
return 'no change'
elif pd.isnull(df['data_type_x']):
return 'new attribute'
elif pd.isnull(df['data_type_y']):
return 'deleted attribute'
elif df['data_type_x'] != df['data_type_y'] and not pd.isnull(df['data_type_x']) and not pd.isnull(df['data_type_y']):
return 'datatype change'
pd_merge['result'] = pd_merge.apply(res, axis = 1)
Because you are setting these up as Pandas DataFrames and not Spark DataFrames. For joins with Pandas DataFrames, you would want to use
DataFrame_output = DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
Run this to understand what DataFrame it is.
type(df)
To use withColumn
, you would need Spark DataFrames. If you want to convert the DataFrames, use this:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
df = spark.createDataFrame(pd_df1)
You mixed up pandas dataframe and Spark dataframe.
The issue is pandas df
doesn't have spark function withColumn
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.