简体   繁体   English

如何在SparkContext.wholeTextFiles之后分别处理多个文件?

[英]How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). 我正在尝试使用WholeTextFiles来读取文件夹中的所有文件名,并分别处理它们(例如,我正在尝试获取每个数据集的SVD向量,总共有100套) 。 The data are saved in .txt files spitted by space and arranged in different lines(like a matrix). 数据保存在以空格分隔的.txt文件中,并以不同的行(如矩阵)排列。

The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. 我遇到的问题是,在我使用“ wholeTextFiles(“带有所有文本文件的路径”)“之后,读取和解析数据确实非常困难,而且我无法像只读取时那样使用该方法一个文件。 The method works fine when I just read one file and it gives me the correct output. 当我只读取一个文件时,该方法可以正常工作,并且可以提供正确的输出。 Could someone please let me know how to fix it here? 有人可以让我知道如何在这里修复它吗? Thanks! 谢谢!

public static void main (String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);
    JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");

    JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
        @Override
        public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
                String content = fileNameContent._2();
                String[] sarray = content .split(" ");
                double[] values = new double[sarray.length];
                for (int i = 0; i< sarray.length; i++){
                    values[i] = Double.parseDouble(sarray[i]);
                }


            pd.cache();
            RowMatrix mat = new RowMatrix(pd.rdd());

            SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
            Vector s = svd.s();
    }});

Quoting the scaladoc of SparkContext.wholeTextFiles : 引用SparkContext.wholeTextFiles的scaladoc

wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. wholeTextFiles(path:String,minPartitions:Int = defaultMinPartitions):RDD [(String,String)]从HDFS,本地文件系统(在所有节点上可用)或任何Hadoop支持的文件系统URI读取文本文件目录。 Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. 每个文件都作为单个记录读取,并以键值对的形式返回,其中键是每个文件的路径,值是每个文件的内容。

In other words, wholeTextFiles might not simply be what you want. 换句话说, wholeTextFiles可能不只是您想要的。

Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter ) to grab a subset of the files to apply the parsing to. 由于设计上“首选小型文件” (请参见scaladoc),因此您可以mapPartitions或进行collect (使用filter )以获取文件的子集以应用解析。

Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel : 一旦掌握了每个分区的文件,就可以使用Scala的Parallel Collection API计划Spark作业以并行执行

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. 在给定的Spark应用程序(SparkContext实例)中,如果多个并行作业是从单独的线程提交的,则它们可以同时运行。 By “job”, in this section, we mean a Spark action (eg save, collect) and any tasks that need to run to evaluate that action. 在本节中,“作业”是指Spark动作(例如,保存,收集)以及需要运行以评估该动作的所有任务。 Spark's scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (eg queries for multiple users). Spark的调度程序是完全线程安全的,并支持此用例,以启用可处理多个请求(例如,针对多个用户的查询)的应用程序。

By default, Spark's scheduler runs jobs in FIFO fashion. 默认情况下,Spark的调度程序以FIFO方式运行作业。 Each job is divided into “stages” (eg map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don't need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly. 每个作业都分为“阶段”(例如,映射和简化阶段),第一个作业在所有可用资源上都具有优先级,而其各个阶段都有要启动的任务,则第二个作业具有优先级,依此类推。队列不需要使用整个集群,以后的作业可以立即开始运行,但是如果队列开头的作业很大,那么以后的作业可能会大大延迟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM