Import CSV to pyspark dataframe

Question

I am new to pyspark, I am trying to load CSV file which looks like this:

my csv file:

   article_id   title                                  short_desc                                           
    33          novel findings support original        asco-cap guidelines support categorization of her2 by fish status used in bcirg clinical trials

my code to read the csv :

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType


spark = SparkSession.builder.appName('Basics').getOrCreate()
schema = StructType([
    StructField("article_id", IntegerType()),
    StructField("title", StringType()),
    StructField("short_desc", StringType()),
    StructField("article_desc", StringType())
])

peopleDF = spark.read.csv('temp.csv', header=True, schema=schema)

peopleDF.show(6)

why is null being added?

dataset sample so that same problem can be reproduced by you:

DataSet Sample

Answer 1

The cells of the excel sheet you are trying to read has 'merged cells'.

Spark will not read them as merged cells, but it will separate out the lines. In your case, the column 'article_desc' consists of such 5 cells vertically, and for the rest of the columns the cells are empty. Hence you have the null values.

If you get all the content to a single cell, you will be able to read it without the null values.

Import CSV to pyspark dataframe

Question

1 answers

solution1
0 ACCPTED 2018-04-24 18:27:38

Import CSV to pyspark dataframe

Question

1 answers

solution1 0 ACCPTED 2018-04-24 18:27:38

solution1
0 ACCPTED 2018-04-24 18:27:38