I am new to pyspark, I am trying to load CSV file which looks like this:
my csv file:
article_id title short_desc
33 novel findings support original asco-cap guidelines support categorization of her2 by fish status used in bcirg clinical trials
my code to read the csv :
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
spark = SparkSession.builder.appName('Basics').getOrCreate()
schema = StructType([
StructField("article_id", IntegerType()),
StructField("title", StringType()),
StructField("short_desc", StringType()),
StructField("article_desc", StringType())
])
peopleDF = spark.read.csv('temp.csv', header=True, schema=schema)
peopleDF.show(6)
why is null being added?
dataset sample so that same problem can be reproduced by you:
The cells of the excel sheet you are trying to read has 'merged cells'.
Spark will not read them as merged cells, but it will separate out the lines. In your case, the column 'article_desc' consists of such 5 cells vertically, and for the rest of the columns the cells are empty. Hence you have the null values.
If you get all the content to a single cell, you will be able to read it without the null values.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.