简体   繁体   中英

Creating a Pandas DataFrame with HDFS file in .csv format

Im trying to create a Spark workflow from fetching .csv data from a hadoop cluster and putting it into Pandas DataFrame. I'm able to pull the Data from HDFS and put it in an RDD, but unable to process it into the Pandas Dataframe. The following is my code:

import pandas as pd
import numpy as nm

A=sc.textFile("hdfs://localhost:9000/sales_ord_univ.csv") # this creates the RDD
B=pd.DataFrame(A) # this gives me the following error:pandas.core.common.PandasError: DataFrame constructor not properly called!

I'm pretty sure this error is as such due to the RDD being a big single list , Hence I tried splitting the data by ';'( ie each new row is a different string) But that didn't seem to help either.

My overall goal is to use Pandas to change CSV into JSON and output into MongoDB. I have done this project using DictReader, PysparkSQL, but wanted to check if it is possible using Pandas.

Any help would be appreciated Thanks!

I would recommend to load the csv into a Spark DataFrame and convert it to a Pandas DataFrame.

csvDf = sqlContext.read.format("csv").option("header", "true").option("inferschema", "true").option("mode", "DROPMALFORMED").load("hdfs://localhost:9000/sales_ord_univ.csv") 
B = csvDf.toPandas()

If you are still using a Spark version < 2.0, you have to use read.format("com.databricks.spark.csv") and include the com.databricks.spark.csv package (eg with the --packages parameter when using the pyspark shell).

you need hdfs (2.0.16)

from hdfs import Config
zzodClient = Config().get_client('zzod') #refer to the docs to set up config
with zzodClient.read(q2Path) as r2Reader:
    r2 = pandas.read_csv(r2Reader)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM