如何确定Apache Spark中的偏移量？

Question

I'm searching through some data files (~20GB). 我正在搜索一些数据文件（约20GB）。 I'd like to find some specific terms in that data and mark the offset for the matches. 我想在该数据中找到一些特定的术语，并为匹配项标记偏移量。 Is there a way to have Spark identify the offset for the chunk of data I'm operating on? 有没有办法让Spark识别我正在处理的数据块的偏移量？

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

import java.util.regex.*;

public class Grep {
        public static void main( String args[] ) {
            SparkConf        conf       = new SparkConf().setMaster( "spark://ourip:7077" );
            JavaSparkContext jsc        = new JavaSparkContext( conf );
            JavaRDD<String>  data       = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
            JavaRDD<String>  filterData = data.filter( new Function<String, Boolean>() {
                    // I'd like to do something here to get the offset in the original file of the string "babe ruth"
                    public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching

            });

            long matches = filterData.count();  // count the hits

            // execute the RDD filter
            System.out.println( "Lines with search terms: " + matches );
 );
        } //  end main
} // end class Grep

I'd like to do something in the "filter" operation to compute the offset of "babe ruth" in the original file. 我想在“过滤器”操作中执行一些操作，以计算原始文件中“婴儿露丝”的偏移量。 I can get the offset of "babe ruth" in the current line, but what's the process or function that tells me the offset of the line within the file? 我可以在当前行中获取“ baby ruth”的偏移量，但是告诉我文件中该行的偏移量的过程或函数是什么？

Answer 1

In Spark common Hadoop Input Format can be used. 在Spark中，可以使用通用的Hadoop输入格式 。 To read the byte offset from the file you can use class TextInputFormat from Hadoop ( org.apache.hadoop.mapreduce.lib.input ). 要从文件读取字节偏移，可以使用Hadoop中的TextInputFormat类（ org.apache.hadoop.mapreduce.lib.input ）。 It is already bundled with Spark. 它已经与Spark捆绑在一起。

It will read the file as key (byte offset) and value (text line): 它将读取文件作为键（字节偏移）和值（文本行）：

An InputFormat for plain text files. 纯文本文件的InputFormat。 Files are broken into lines. 文件分为几行。 Either linefeed or carriage-return are used to signal end of line. 换行或回车均用于表示行结束。 Keys are the position in the file, and values are the line of text. 键是文件中的位置，值是文本行。

In Spark it can be used by calling newAPIHadoopFile() 在Spark中，可以通过调用newAPIHadoopFile()来使用它

SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);

// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
        "file_path", // input path
        TextInputFormat.class, // used input format class
        LongWritable.class, // class of the value
        Text.class, // class of the value
        new Configuration());    

JavaRDD<String> mapped = data.map(new Function<Tuple2<LongWritable, Text>, String>() {
    @Override
    public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
        // you will get each line from as a tuple (offset, text)    
        long pos = tuple._1().get(); // extract offset
        String line = tuple._2().toString(); // extract text

        return pos + " " + line;
    }
});

Answer 2

You could use the wholeTextFiles(String path, int minPartitions) method from JavaSparkContext to return a JavaPairRDD<String,String> where the key is filename and the value is a string containing the entire content of a file (thus, each record in this RDD represents a file). 你可以使用wholeTextFiles(String path, int minPartitions)方法从JavaSparkContext返回一个JavaPairRDD<String,String>其中键是文件名和值是包含一个文件（因此的整个内容的字符串，在该RDD每个记录代表一个文件）。 From here, simply run a map() that will call indexOf(String searchString) on each value. 从这里，只需运行一个map() ，它将对每个值调用indexOf(String searchString) 。 This should return the first index in each file with the occurrence of the string in question. 这应该返回每个文件中的第一个索引，并包含出现问题的字符串。

(EDIT:) （编辑：）

So finding the offset in a distributed fashion for one file (per your use case below in the comments) is possible. 因此，可以以分布式方式为一个文件找到偏移量（根据注释，在下面的用例中）。 Below is an example that works in Scala. 下面是在Scala中工作的示例。

val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)

// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()

// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()

// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = filterLines.map { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)

Note that you would additionally need to add any new line characters manually on top of this since they are not accounted for (the input format uses new lines as demarcations between records). 请注意，您还需要在此之上手动添加任何新行字符，因为它们没有被考虑（输入格式使用新行作为记录之间的分界）。 The number of new lines is simply the number of lines before the line containing the search string so this is trivial to add. 新行数只是包含搜索字符串的行之前的行数，因此添加起来很简单。

I'm not entirely familiar with the Java API unfortunately and it's not exactly easy to test so I'm not sure if the code below works but have at it (Also, I used Java 1.7 but 1.8 compresses a lot of this code with lambda expressions.): 不幸的是，我并不完全熟悉Java API，并且测试起来也不容易，因此我不确定下面的代码是否有效，但是可以使用（此外，我使用了Java 1.7，但1.8使用lambda压缩了很多代码表达式。）：

String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");

JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();

Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
  }).first();

JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
  });

Long offset = filterLines.map(new Function<Tuple2<String, Long>, Int>() {
      public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
  }).reduce(new Function2<Integer, Integer, Integer>() {
      public Integer call(Integer a, Integer b) { return a + b; }
  }) + firstFind.productElement(0).indexOf(searchString);

This can only be done when your input is one file (since otherwise, zipWithIndex() wouldn't guarantee offsets within a file) but this method works for an RDD of any number of partitions so feel free to partition your file up into any number of chunks. 仅当您的输入是一个文件时才可以执行此操作（否则， zipWithIndex()不能保证文件中的偏移量），但是此方法适用于任意数量的分区的RDD，因此可以将文件任意分区为任意数量大块。

如何确定Apache Spark中的偏移量？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-10-24 17:57:45

解决方案2
0 2015-10-21 17:37:46

如何确定Apache Spark中的偏移量？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-10-24 17:57:45

解决方案2 0 2015-10-21 17:37:46

解决方案1
2 已采纳 2015-10-24 17:57:45

解决方案2
0 2015-10-21 17:37:46