简体   繁体   中英

Spark dataframe with strange values after reading CSV

Coming from here , I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True) , but now I have some weird values in some cells, as you can see in this picture (last lins):

在此处输入图像描述

Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!

Spark seems to have difficulty in reading this line:

2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...

because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.

df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))

The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.

If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED' :

df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM