使用Spark访问放置在HDFS中的CSV文件

Question

I have placed a csv file into the hdfs filesystem using hadoop -put command. 我已经使用hadoop -put命令将一个csv文件放入hdfs文件系统中。 I now need to access the csv file using pyspark csv . 现在，我需要使用pyspark csv访问csv文件。 Its format is something like 它的格式类似于

`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`

I am a newbie to hdfs. 我是hdfs的新手。 How do I find the address to be placed in hdfs://xxxx ? 我如何找到要放置在hdfs://xxxx的地址？

Here's the output when I entered 这是我输入时的输出

hduser@remus:~$ hdfs dfs -ls /input

Found 1 items
-rw-r--r--   1 hduser supergroup        158 2015-06-12 14:13 /input/test.csv

Any help is appreciated. 任何帮助表示赞赏。

Answer 1

you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned. 您需要在HDFS中提供文件的完整路径，并且该URL将在您提到的hadoop配置核心站点或hdfs站点中提及。

Check your core-site.xml & hdfs-site.xml for get the details about url. 检查您的core-site.xml和hdfs-site.xml以获得有关url的详细信息。

Easy way to find any url is access your hdfs from your browser and get the path. 查找任何URL的简单方法是从浏览器访问hdfs并获取路径。

If you are using absolute path in your file system use file:///<your path>

Answer 2

Try to specify absolute path without hdfs:// 尝试指定没有hdfs：//的绝对路径

plaintext_rdd = sc.textFile('/input/test.csv')

Spark while running on the same cluster with HDFS use hdfs:// as default FS. 使用HDFS在同一群集上运行时，Spark使用hdfs：//作为默认FS。

Answer 3

Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below: 通过指向可以读取csv文件的包来启动spark shell或spark-submit，如下所示：

spark-shell  --packages com.databricks:spark-csv_2.11:1.2.0

And in the spark code, you can read the csv file as below: 在星火代码中，您可以读取csv文件，如下所示：

val data_df = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true")
              .schema(<pass schema if required>)
              .load(<location in HDFS/S3>)

使用Spark访问放置在HDFS中的CSV文件

问题描述

3 个解决方案

解决方案1
1 已采纳 2015-06-12 09:52:23

解决方案2
0 2015-06-12 12:43:49

解决方案3
0 2018-10-05 12:39:38

使用Spark访问放置在HDFS中的CSV文件

问题描述

3 个解决方案

解决方案1 1 已采纳 2015-06-12 09:52:23

解决方案2 0 2015-06-12 12:43:49

解决方案3 0 2018-10-05 12:39:38

解决方案1
1 已采纳 2015-06-12 09:52:23

解决方案2
0 2015-06-12 12:43:49

解决方案3
0 2018-10-05 12:39:38