简体   繁体   中英

NULL values when trying to import CSV in Azure Databricks DBFS

I'm trying to import a CSV file that has no headers into DBFS for Azure Databricks, but, no matter whether I use the UI or try to do it by code, the output shows null values for all the four columns.

Here's the code I ran:

from pyspark.sql.types import *

# File location and type
file_location = "/FileStore/tables/sales.csv"
file_type = "csv"

# Options

delimiter = ","

customSchema = StructType([\
    StructField("id", StringType(), True),\
    StructField("company", IntegerType(), True),\
    StructField("date", TimestampType(), True),\
    StructField("price", DoubleType(), True)])

# Dataframe from CSV
df = spark.read.format(file_type) \
    .schema(customSchema) \
    .option("sep", delimiter) \
    .load(file_location)

display(df)

And the output I get is:

数据块输出

What is happening here? If I don't define any schema it loads the data perfectly, but then I don't have the chance neither to specify the headers nor to specify datatypes.

I uploaded a sample csv file to test your script, which content is as below.

1,Company-A,2019-09-30,10.01
2,Company-B,2019-09-29,20.02

Then, I tried to reproduce your issue successfully as the figure below, and I think the issue was caused by the incorrect type for the struct field company .

在此处输入图像描述

So I tried to use StringType instead of IntegerType for the field company , then it works fine, as the figure below.

在此处输入图像描述

Meanwhile, if the value of the date field is just a date, you can use DateType instead of TimestampType and the result as the figure below.

在此处输入图像描述

By the way, there are two other solutions for your needs.

  1. Use the spark.read.csv function with schema parameter to read a headless csv file, as the code and figure below.

     df = spark.read.csv(file_location, schema='id INT, company STRING, date DATE, price DOUBLE') display(df)

    在此处输入图像描述

  2. Use pandas package to read the csv file from dbfs file path on Azure Databricks first, then to create a Spark DataFrame from the pandas dataframe, as the code and figure below.

     import pandas as pd df_pandas = pd.read_csv('/dbfs/FileStore/tables/sales.csv', header=None, names = ['id', 'company', 'date', 'price']) df = spark.createDataFrame(df_pandas) display(df)

    在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM