RDI based API code to read a csv file and convert it into tuples :
# load data
movie_rating = sc.textFile('140419_Movie_Rating.csv')
# preprocess data -- only need ["userId", "movieId", "rating"]
header = movie_rating.take(1)[0]
rating_data = movie_rating \
.filter(lambda line: line!=header) \
.map(lambda line: line.split(",")) \
.map(lambda tokens: (int(tokens[0]), int(tokens[1]), int(tokens[2]))) \
.cache()
# check three rows
rating_data.take(3)
output :
[(6156680, 433441, 2), (6156680, 433400, 1), (6156680, 433400, 1)]
Basically, i am reading a CSV file using an RDD-based API from pyspark.mllb to reload the sdata using sc.textFile and convert it to the form of ( user_id, video_id, rating )
Now if I need to do the same operation using Dataframe based API code ? How it can be achieved ?
Spark dataframe API supports reading csv files with a separator.
Let's create our csv file:
import pandas as pd
pd.DataFrame([(6156680, 433441, 2), (6156680, 433400, 1), (6156680, 433400, 1)], columns=['user_id', 'video_id', 'rating']) \
.to_csv('140419_Movie_Rating.csv', index=False)
Now we can read the file, with a header, default separator is ',':
df = spark.read.csv('140419_Movie_Rating.csv', header=True, inferSchema=True)
df.show()
df.printSchema()
+-------+--------+------+
|user_id|video_id|rating|
+-------+--------+------+
|6156680| 433441| 2|
|6156680| 433400| 1|
|6156680| 433400| 1|
+-------+--------+------+
root
|-- user_id: integer (nullable = true)
|-- video_id: integer (nullable = true)
|-- rating: integer (nullable = true)
Try this:
rating_data_df = spark.read.format('csv')\
.option('header', 'true')\
.option('inferSchema', 'true')\
.load('140419_Movie_Rating.csv')
rating_data_df.take(3)
In your case it should output something like this:
[Row(userId=6156680, movieId=433441, rating=2), Row(userId=6156680, movieId=433400, rating=1), Row(userId=6156680, movieId=433400, rating=1)]
You can read more about these generic functions here: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.