從集群上的HDFS讀取數據

Question

我正在嘗試使用Jupiter Notebook從AWS EC2集群上的HDFS讀取數據。 它有7個節點。 我正在使用HDP 2.4，下面是我的代碼。 該表具有數百萬行，但是代碼不返回任何行。“ ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com”是服務器（ambari服務器）。

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()

但是使用sc.textFile，我得到正確的行數

 data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv")
schema= data.map(lambda x: x.split(",")).first()  #get schema
header = data.first()                          # extract header
data=data.filter(lambda x:x !=header)          # filter out header

data= data.map(lambda x: x.split(","))
data.count()
3641865

Answer 1

Indrajit 在這里給出的答案解決了我的問題。 問題出在spark-csv jar。

從集群上的HDFS讀取數據

問題描述

1 個解決方案

解決方案1
0 已采納 2016-08-08 18:59:32

從集群上的HDFS讀取數據

問題描述

1 個解決方案

解決方案1 0 已采納 2016-08-08 18:59:32

解決方案1
0 已采納 2016-08-08 18:59:32