简体   繁体   中英

Import CSV to pyspark dataframe

I am new to pyspark, I am trying to load CSV file which looks like this:

my csv file:

   article_id   title                                  short_desc                                           
    33          novel findings support original        asco-cap guidelines support categorization of her2 by fish status used in bcirg clinical trials  

my code to read the csv :

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType


spark = SparkSession.builder.appName('Basics').getOrCreate()
schema = StructType([
    StructField("article_id", IntegerType()),
    StructField("title", StringType()),
    StructField("short_desc", StringType()),
    StructField("article_desc", StringType())
])

peopleDF = spark.read.csv('temp.csv', header=True, schema=schema)

peopleDF.show(6)

代码更改后

why is null being added?

dataset sample so that same problem can be reproduced by you:

DataSet Sample

The cells of the excel sheet you are trying to read has 'merged cells'.

Spark will not read them as merged cells, but it will separate out the lines. In your case, the column 'article_desc' consists of such 5 cells vertically, and for the rest of the columns the cells are empty. Hence you have the null values.

If you get all the content to a single cell, you will be able to read it without the null values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM