简体   繁体   English

如何在sc.textFile中而不是HDFS中加载本地文件

[英]How to load local file in sc.textFile, instead of HDFS

I'm following the great spark tutorial 我正在关注很棒的Spark教程

so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this: 所以我正在尝试在46m:00s加载README.md但无法执行以下操作:

$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)

how can I load that README.md ? 如何加载README.md

Try explicitly specify sc.textFile("file:///path to the file/") . 尝试显式指定sc.textFile("file:///path to the file/") The error occurs when Hadoop environment is set. 设置Hadoop环境时会发生此错误。

SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits , which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. SparkContext.textFile在内部调用org.apache.hadoop.mapred.FileInputFormat.getSplits ,如果缺少架构,则转而使用org.apache.hadoop.fs.getDefaultUri This method reads "fs.defaultFS" parameter of Hadoop conf. 此方法读取Hadoop conf的“ fs.defaultFS”参数。 If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; 如果设置HADOOP_CONF_DIR环境变量,则该参数通常设置为“ hdfs:// ...”; otherwise "file://". 否则为“ file://”。

gonbe's answer is excellent. 贡贝的答案非常好。 But still I want to mention that file:/// = ~/../../ , not $SPARK_HOME . 但我仍然要提到file:/// = ~/../../ ,而不是$SPARK_HOME Hope this could save some time for newbs like me. 希望这可以为像我这样的新手节省一些时间。

While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster. 尽管Spark支持从本地文件系统加载文件,但它要求文件在群集中所有节点上的同一路径上可用。

Some network filesystems, like NFS, AFS, and MapR's NFS layer, are exposed to the user as a regular filesystem. 一些网络文件系统(例如NFS,AFS和MapR的NFS层)作为常规文件系统向用户公开。

If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; 如果您的数据已经在这些系统之一中,则只需指定file://路径就可以将其用作输入。 Spark will handle it as long as the filesystem is mounted at the same path on each node. 只要文件系统安装在每个节点的相同路径上,Spark就会处理它。 Every node needs to have the same path 每个节点都必须具有相同的路径

 rdd = sc.textFile("file:///path/to/file")

If your file isn't already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers 如果文件尚未位于集群中的所有节点上,则可以在不通过Spark的情况下在驱动程序上本地加载文件,然后调用parallelize将内容分发给worker

Take care to put file:// in front and the use of "/" or "\\" according to OS. 请注意将file://放在前面,并根据操作系统使用“ /”或“ \\”。

You need just to specify the path of the file as "file:///directory/file" 您只需要将文件的路径指定为“ file:/// directory / file”

example: 例:

val textFile = sc.textFile("file:///usr/local/spark/README.md")

Attention: 注意:

Make sure that you run spark in local mode when you load data from local( sc.textFile("file:///path to the file/") ) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist . 从本地加载数据时,请确保在本地模式下运行spark( sc.textFile("file:///path to the file/") ),否则会收到类似的错误, Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist Becasuse executors which run on different workers will not find this file in it's local path. 因为在不同工作程序上运行的执行程序不会在本地路径中找到此文件。

I have a file called NewsArticle.txt on my Desktop. 我的桌面上有一个名为NewsArticle.txt的文件。

In Spark, I typed: 在Spark中,我输入:

val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)

I needed to change all the \\ to / character for the filepath. 我需要将文件路径的所有\\更改为/字符。

To test if it worked, I typed: 为了测试它是否有效,我输入了:

textFile.foreach(println)

I'm running Windows 7 and I don't have Hadoop installed. 我正在运行Windows 7,但未安装Hadoop。

This has been discussed into spark mailing list, and please refer this mail . 这已经在spark邮件列表中讨论过了,请参阅此邮件

You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs : 您应该使用hadoop fs -put <localsrc> ... <dst>将文件复制到hdfs

${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md

If the file is located in your Spark master node (eg, in case of using AWS EMR), then launch the spark-shell in local mode first. 如果文件位于您的Spark主节点中(例如,在使用AWS EMR的情况下),请首先以本地模式启动spark-shell。

$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (eg, YARN in case of using AWS EMR) to read the file directly. 或者,您可以先将文件从本地文件系统复制到HDFS,然后以默认模式启动Spark(例如,在使用AWS EMR的情况下为YARN)以直接读取文件。

$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r--   1 hadoop hadoop         73 2017-05-01 00:49 /hdfs/spark/examples/people.json

$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs , and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml . 对于Spark 2.3来说,这已经发生了,并且Hadoop也安装在公用的“ hadoop”用户主目录下。由于Spark和Hadoop都安装在同一公用目录下,因此Spark默认情况下将方案视为hdfs ,并开始查找输入Hadoop的core-site.xml fs.defaultFS指定的hdfs下的文件。 Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file> . 在这种情况下,我们需要将方案明确指定为file:///<absoloute path to file>

This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster: 这是我在Windows群集上的Azure中托管的Spark群集上遇到的此错误的解决方案:

Load the raw HVAC.csv file, parse it using the function 加载原始的HVAC.csv文件,使用函数进行解析

data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")

We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder. 我们使用(wasb:///)允许Hadoop访问azure博客存储文件,三个斜杠是对正在运行的节点容器文件夹的相对引用。

For example: If the path for your file in File Explorer in Spark cluster dashboard is: 例如:如果您在Spark群集仪表板的“文件资源管理器”中文件的路径为:

sflcc1\\sflccspark1\\HdiSamples\\SensorSampleData\\hvac sflcc1 \\ sflccspark1 \\ HdiSamples \\ SensorSampleData \\ hvac

So to describe the path is as follows: sflcc1: is the name of the storage account. 因此要描述的路径如下:sflcc1:是存储帐户的名称。 sflccspark: is the cluster node name. sflccspark:是集群节点名称。

So we refer to the current cluster node name with the relative three slashes. 因此,我们用相对的三个斜杠来引用当前的群集节点名称。

Hope this helps. 希望这可以帮助。

If your trying to read file form HDFS. 如果您尝试读取HDFS文件格式。 trying setting path in SparkConf 尝试在SparkConf中设置路径

 val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
 conf.set("fs.defaultFS", "hdfs://hostname:9000")

You do not have to use sc.textFile(...) to convert local files into dataframes. 您不必使用sc.textFile(...)将本地文件转换为数据帧。 One of options is, to read a local file line by line and then transform it into Spark Dataset. 选项之一是,逐行读取本地文件,然后将其转换为Spark Dataset。 Here is an example for Windows machine in Java: 这是Java中Windows机器的示例:

StructType schemata = DataTypes.createStructType(
            new StructField[]{
                    createStructField("COL1", StringType, false),
                    createStructField("COL2", StringType, false),
                    ...
            }
    );

String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );

List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
    String line;
    while ((line = br.readLine()) != null) {
      String[] vals = line.split(separator);
      result.add(vals);
    }
 } catch (Exception ex) {
       System.out.println(ex.getMessage());
       throw new RuntimeException(ex);
  }
  JavaRDD<String[]> jRdd = jsc.parallelize(result);
  JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
  Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);

Now you can use dataframe data in your code. 现在,您可以在代码中使用数据框data

我尝试了以下操作,并且它可以从本地文件系统运行。基本上,spark可以从本地,HDFS和AWS S3路径读取

listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")

尝试

val f = sc.textFile("./README.md")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM