简体   繁体   English

使用Spark访问放置在HDFS中的CSV文件

[英]Accessing csv file placed in hdfs using spark

I have placed a csv file into the hdfs filesystem using hadoop -put command. 我已经使用hadoop -put命令将一个csv文件放入hdfs文件系统中。 I now need to access the csv file using pyspark csv . 现在,我需要使用pyspark csv访问csv文件。 Its format is something like 它的格式类似于

`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`

I am a newbie to hdfs. 我是hdfs的新手。 How do I find the address to be placed in hdfs://xxxx ? 我如何找到要放置在hdfs://xxxx的地址?

Here's the output when I entered 这是我输入时的输出

hduser@remus:~$ hdfs dfs -ls /input

Found 1 items
-rw-r--r--   1 hduser supergroup        158 2015-06-12 14:13 /input/test.csv

Any help is appreciated. 任何帮助表示赞赏。

you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned. 您需要在HDFS中提供文件的完整路径,并且该URL将在您提到的hadoop配置核心站点或hdfs站点中提及。

Check your core-site.xml & hdfs-site.xml for get the details about url. 检查您的core-site.xml和hdfs-site.xml以获得有关url的详细信息。

Easy way to find any url is access your hdfs from your browser and get the path. 查找任何URL的简单方法是从浏览器访问hdfs并获取路径。

If you are using absolute path in your file system use file:///<your path>

Try to specify absolute path without hdfs:// 尝试指定没有hdfs://的绝对路径

plaintext_rdd = sc.textFile('/input/test.csv')

Spark while running on the same cluster with HDFS use hdfs:// as default FS. 使用HDFS在同一群集上运行时,Spark使用hdfs://作为默认FS。

Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below: 通过指向可以读取csv文件的包来启动spark shell或spark-submit,如下所示:

spark-shell  --packages com.databricks:spark-csv_2.11:1.2.0

And in the spark code, you can read the csv file as below: 在星火代码中,您可以读取csv文件,如下所示:

val data_df = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true")
              .schema(<pass schema if required>)
              .load(<location in HDFS/S3>)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM