Spark dataframe with strange values after reading CSV

Question

Coming from here , I'm trying to read the correct values from this dataset in Pyspark. I made a good progress using df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True) , but now I have some weird values in some cells, as you can see in this picture (last lins):

Do you know how could I get rid of them? Or else, how can I read the CSV with format using another program? It's very hard for me to use a text editor like Vim or Nano and try to guess where are the errors. Thank you!

Answer 1

Spark seems to have difficulty in reading this line:

2020-10-15 00:00:23,1.3165293165079306e+18,"""IS THIS WRONG??!!"" ...

because there are three double quotes. However pandas seem to understand that well, so as a workaround, you can use pandas to read the csv file first, and convert to a Spark dataframe. Normally this is not recommended because of the large overhead involved, but for this small csv file the performance hit should be acceptable.

df = spark.createDataFrame(pd.read_csv('hashtag_donaldtrump.csv').replace({float('nan'): None}))

The replace is for replacing nan with None in the pandas dataframe. Spark thinks nan is a float, and it gets confused when there is nan in string type columns.

If the file is too large for pandas, then you can consider dropping those rows that Spark cannot parse using mode='DROPMALFORMED' :

df = spark.read.csv('hashtag_donaldtrump.csv', header=True, multiLine=True, mode='DROPMALFORMED')

Spark dataframe with strange values after reading CSV

Question

1 answers

solution1
0 2021-01-15 11:18:45

Spark dataframe with strange values after reading CSV

Question

1 answers

solution1 0 2021-01-15 11:18:45

solution1
0 2021-01-15 11:18:45