简体   繁体   中英

How to merge pyspark and pandas dataframes

I have a very large pyspark dataframe and a smaller pandas dataframe which I read in as follows:

df1 = spark.read.csv("/user/me/data1/")
df2 = pd.read_csv("data2.csv")

Both dataframes include columns labelled "A" and "B". I would like to create another pyspark dataframe with only those rows from df1 where the entries in columns "A" and "B" occur in those columns with the same name in df2 . That is to filter df1 using columns "A" and "B" of df2.

Normally I think this would be a join (implemented with merge ) but how do you join a pandas dataframe with a pyspark one?

I can't afford to convert df1 to a pandas dataframe.

you can either pass the schema while converting from pandas dataframe to pyspark dataframe like this:

from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
df = sqlContext.createDataFrame(pandas_dataframe, schema)

or you can use the hack i have used in this function:

def create_spark_dataframe(file_name):
    """
    will return the spark dataframe input pandas dataframe
    """
    pandas_data_frame = pd.read_csv(file_name)
    for col in pandas_data_frame.columns:
      if ((pandas_data_frame[col].dtypes != np.int64) & (pandas_data_frame[col].dtypes != np.float64)):
        pandas_data_frame[col] = pandas_data_frame[col].fillna('')

    spark_data_frame = sqlContext.createDataFrame(pandas_data_frame)
    return spark_data_frame

you can use this code snippet for your help:

df1 = spark.read.csv("/user/me/data1/")
df2 = pd.read_csv("data2.csv", keep_default_na=False)
df3 = df = sqlContext.createDataFrame(df2, schema)
df = df1.join(df3, ["A", "B"])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM