简体   繁体   中英

Convert a RDD based API code to Dataframe based API code in pyspark

RDI based API code to read a csv file and convert it into tuples :

# load data
movie_rating = sc.textFile('140419_Movie_Rating.csv')
# preprocess data -- only need ["userId", "movieId", "rating"]
header = movie_rating.take(1)[0]
rating_data = movie_rating \
    .filter(lambda line: line!=header) \
    .map(lambda line: line.split(",")) \
    .map(lambda tokens: (int(tokens[0]), int(tokens[1]), int(tokens[2]))) \
    .cache()
# check three rows
rating_data.take(3)

output :

[(6156680, 433441, 2), (6156680, 433400, 1), (6156680, 433400, 1)]

Basically, i am reading a CSV file using an RDD-based API from pyspark.mllb to reload the sdata using sc.textFile and convert it to the form of ( user_id, video_id, rating )

Now if I need to do the same operation using Dataframe based API code ? How it can be achieved ?

Spark dataframe API supports reading csv files with a separator.

Let's create our csv file:

import pandas as pd
pd.DataFrame([(6156680, 433441, 2), (6156680, 433400, 1), (6156680, 433400, 1)], columns=['user_id', 'video_id', 'rating']) \
    .to_csv('140419_Movie_Rating.csv', index=False)

Now we can read the file, with a header, default separator is ',':

df = spark.read.csv('140419_Movie_Rating.csv', header=True, inferSchema=True)
df.show()
df.printSchema()

        +-------+--------+------+
        |user_id|video_id|rating|
        +-------+--------+------+
        |6156680|  433441|     2|
        |6156680|  433400|     1|
        |6156680|  433400|     1|
        +-------+--------+------+

        root
         |-- user_id: integer (nullable = true)
         |-- video_id: integer (nullable = true)
         |-- rating: integer (nullable = true)

Try this:

rating_data_df = spark.read.format('csv')\
    .option('header', 'true')\
    .option('inferSchema', 'true')\
    .load('140419_Movie_Rating.csv')
rating_data_df.take(3)

In your case it should output something like this:

[Row(userId=6156680, movieId=433441, rating=2), Row(userId=6156680, movieId=433400, rating=1), Row(userId=6156680, movieId=433400, rating=1)]

You can read more about these generic functions here: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM