[英]Accessing csv file placed in hdfs using spark
I have placed a csv file into the hdfs filesystem using hadoop -put
command. 我已经使用
hadoop -put
命令将一个csv文件放入hdfs文件系统中。 I now need to access the csv file using pyspark csv . 现在,我需要使用pyspark csv访问csv文件。 Its format is something like
它的格式类似于
`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`
I am a newbie to hdfs. 我是hdfs的新手。 How do I find the address to be placed in
hdfs://xxxx
? 我如何找到要放置在
hdfs://xxxx
的地址?
Here's the output when I entered 这是我输入时的输出
hduser@remus:~$ hdfs dfs -ls /input
Found 1 items
-rw-r--r-- 1 hduser supergroup 158 2015-06-12 14:13 /input/test.csv
Any help is appreciated. 任何帮助表示赞赏。
you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned. 您需要在HDFS中提供文件的完整路径,并且该URL将在您提到的hadoop配置核心站点或hdfs站点中提及。
Check your core-site.xml & hdfs-site.xml for get the details about url.
检查您的core-site.xml和hdfs-site.xml以获得有关url的详细信息。
Easy way to find any url is access your hdfs from your browser and get the path. 查找任何URL的简单方法是从浏览器访问hdfs并获取路径。
If you are using absolute path in your file system use file:///<your path>
Try to specify absolute path without hdfs:// 尝试指定没有hdfs://的绝对路径
plaintext_rdd = sc.textFile('/input/test.csv')
Spark while running on the same cluster with HDFS use hdfs:// as default FS. 使用HDFS在同一群集上运行时,Spark使用hdfs://作为默认FS。
Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below: 通过指向可以读取csv文件的包来启动spark shell或spark-submit,如下所示:
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
And in the spark code, you can read the csv file as below: 在星火代码中,您可以读取csv文件,如下所示:
val data_df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.schema(<pass schema if required>)
.load(<location in HDFS/S3>)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.