How to merge pyspark and pandas dataframes

Question

I have a very large pyspark dataframe and a smaller pandas dataframe which I read in as follows:

df1 = spark.read.csv("/user/me/data1/")
df2 = pd.read_csv("data2.csv")

Both dataframes include columns labelled "A" and "B". I would like to create another pyspark dataframe with only those rows from df1 where the entries in columns "A" and "B" occur in those columns with the same name in df2 . That is to filter df1 using columns "A" and "B" of df2.

Normally I think this would be a join (implemented with merge ) but how do you join a pandas dataframe with a pyspark one?

I can't afford to convert df1 to a pandas dataframe.

Answer 1

you can either pass the schema while converting from pandas dataframe to pyspark dataframe like this:

from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
df = sqlContext.createDataFrame(pandas_dataframe, schema)

or you can use the hack i have used in this function:

def create_spark_dataframe(file_name):
    """
    will return the spark dataframe input pandas dataframe
    """
    pandas_data_frame = pd.read_csv(file_name)
    for col in pandas_data_frame.columns:
      if ((pandas_data_frame[col].dtypes != np.int64) & (pandas_data_frame[col].dtypes != np.float64)):
        pandas_data_frame[col] = pandas_data_frame[col].fillna('')

    spark_data_frame = sqlContext.createDataFrame(pandas_data_frame)
    return spark_data_frame

Answer 2

you can use this code snippet for your help:

df1 = spark.read.csv("/user/me/data1/")
df2 = pd.read_csv("data2.csv", keep_default_na=False)
df3 = df = sqlContext.createDataFrame(df2, schema)
df = df1.join(df3, ["A", "B"])

How to merge pyspark and pandas dataframes

Question

2 answers

solution1
6 2017-09-19 16:31:33

solution2
1 2018-09-13 09:22:45

How to merge pyspark and pandas dataframes

Question

2 answers

solution1 6 2017-09-19 16:31:33

solution2 1 2018-09-13 09:22:45

solution1
6 2017-09-19 16:31:33

solution2
1 2018-09-13 09:22:45