简体   繁体   English

如何确定Apache Spark中的偏移量?

[英]How do I determine an offset in Apache Spark?

I'm searching through some data files (~20GB). 我正在搜索一些数据文件(约20GB)。 I'd like to find some specific terms in that data and mark the offset for the matches. 我想在该数据中找到一些特定的术语,并为匹配项标记偏移量。 Is there a way to have Spark identify the offset for the chunk of data I'm operating on? 有没有办法让Spark识别我正在处理的数据块的偏移量?

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

import java.util.regex.*;

public class Grep {
        public static void main( String args[] ) {
            SparkConf        conf       = new SparkConf().setMaster( "spark://ourip:7077" );
            JavaSparkContext jsc        = new JavaSparkContext( conf );
            JavaRDD<String>  data       = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
            JavaRDD<String>  filterData = data.filter( new Function<String, Boolean>() {
                    // I'd like to do something here to get the offset in the original file of the string "babe ruth"
                    public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching

            });

            long matches = filterData.count();  // count the hits

            // execute the RDD filter
            System.out.println( "Lines with search terms: " + matches );
 );
        } //  end main
} // end class Grep

I'd like to do something in the "filter" operation to compute the offset of "babe ruth" in the original file. 我想在“过滤器”操作中执行一些操作,以计算原始文件中“婴儿露丝”的偏移量。 I can get the offset of "babe ruth" in the current line, but what's the process or function that tells me the offset of the line within the file? 我可以在当前行中获取“ baby ruth”的偏移量,但是告诉我文件中该行的偏移量的过程或函数是什么?

In Spark common Hadoop Input Format can be used. 在Spark中,可以使用通用的Hadoop输入格式 To read the byte offset from the file you can use class TextInputFormat from Hadoop ( org.apache.hadoop.mapreduce.lib.input ). 要从文件读取字节偏移,可以使用Hadoop中的TextInputFormat类( org.apache.hadoop.mapreduce.lib.input )。 It is already bundled with Spark. 它已经与Spark捆绑在一起。

It will read the file as key (byte offset) and value (text line): 它将读取文件作为 (字节偏移)和 (文本行):

An InputFormat for plain text files. 纯文本文件的InputFormat。 Files are broken into lines. 文件分为几行。 Either linefeed or carriage-return are used to signal end of line. 换行或回车均用于表示行结束。 Keys are the position in the file, and values are the line of text. 键是文件中的位置,值是文本行。

In Spark it can be used by calling newAPIHadoopFile() 在Spark中,可以通过调用newAPIHadoopFile()来使用它

SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);

// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
        "file_path", // input path
        TextInputFormat.class, // used input format class
        LongWritable.class, // class of the value
        Text.class, // class of the value
        new Configuration());    

JavaRDD<String> mapped = data.map(new Function<Tuple2<LongWritable, Text>, String>() {
    @Override
    public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
        // you will get each line from as a tuple (offset, text)    
        long pos = tuple._1().get(); // extract offset
        String line = tuple._2().toString(); // extract text

        return pos + " " + line;
    }
});

You could use the wholeTextFiles(String path, int minPartitions) method from JavaSparkContext to return a JavaPairRDD<String,String> where the key is filename and the value is a string containing the entire content of a file (thus, each record in this RDD represents a file). 你可以使用wholeTextFiles(String path, int minPartitions)方法从JavaSparkContext返回一个JavaPairRDD<String,String>其中键是文件名和值是包含一个文件(因此的整个内容的字符串,在该RDD每个记录代表一个文件)。 From here, simply run a map() that will call indexOf(String searchString) on each value. 从这里,只需运行一个map() ,它将对每个值调用indexOf(String searchString) This should return the first index in each file with the occurrence of the string in question. 这应该返回每个文件中的第一个索引,并包含出现问题的字符串。

(EDIT:) (编辑:)

So finding the offset in a distributed fashion for one file (per your use case below in the comments) is possible. 因此,可以以分布式方式为一个文件找到偏移量(根据注释,在下面的用例中)。 Below is an example that works in Scala. 下面是在Scala中工作的示例。

val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)

// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()

// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()

// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = filterLines.map { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)

Note that you would additionally need to add any new line characters manually on top of this since they are not accounted for (the input format uses new lines as demarcations between records). 请注意,您还需要在此之上手动添加任何新行字符,因为它们没有被考虑(输入格式使用新行作为记录之间的分界)。 The number of new lines is simply the number of lines before the line containing the search string so this is trivial to add. 新行数只是包含搜索字符串的行之前的行数,因此添加起来很简单。

I'm not entirely familiar with the Java API unfortunately and it's not exactly easy to test so I'm not sure if the code below works but have at it (Also, I used Java 1.7 but 1.8 compresses a lot of this code with lambda expressions.): 不幸的是,我并不完全熟悉Java API,并且测试起来也不容易,因此我不确定下面的代码是否有效,但是可以使用(此外,我使用了Java 1.7,但1.8使用lambda压缩了很多代码表达式。):

String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");

JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();

Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
  }).first();

JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
  });

Long offset = filterLines.map(new Function<Tuple2<String, Long>, Int>() {
      public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
  }).reduce(new Function2<Integer, Integer, Integer>() {
      public Integer call(Integer a, Integer b) { return a + b; }
  }) + firstFind.productElement(0).indexOf(searchString);

This can only be done when your input is one file (since otherwise, zipWithIndex() wouldn't guarantee offsets within a file) but this method works for an RDD of any number of partitions so feel free to partition your file up into any number of chunks. 仅当您的输入是一个文件时才可以执行此操作(否则, zipWithIndex()不能保证文件中的偏移量),但是此方法适用于任意数量的分区的RDD,因此可以将文件任意分区为任意数量大块。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark本身中使用Kafka在Spark流中实现偏移管理? - How do I implement offset management in Spark streaming with Kafka inside Spark itself? 如何使用Apache Spark JavaRDD在MongoDB中进行查询? - How do I query in MongoDB with Apache Spark JavaRDDs? 给定 Java InputStream,如何确定流中的当前偏移量? - Given a Java InputStream, how can I determine the current offset in the stream? 如何从Java / Apache中的应用程序服务器确定Web服务器的主机名 - How do I determine the hostname of a web server from an app server in Java/Apache 在apache spark中,如何在groupBy()之后将一列mllib Vector收集到列表中? - In apache spark, how do I collect a column of mllib Vector into a list after groupBy()? 如何使用 Java 将 unix 纪元的列转换为 Apache spark DataFrame 中的日期? - How do I convert column of unix epoch to Date in Apache spark DataFrame using Java? 如何使用Java将一行数组平面映射到Apache spark中的多行? - How do I flatMap a row of arrays into multiple rows in Apache spark using Java? 如何从 Java 连接到 csv 文件并将其写入 Databricks Apache Spark 的远程实例? - How do I connect to and write a csv file to a remote instance of Databricks Apache Spark from Java? 如何获得某个位置的区域偏移? - How do I get zone offset for a location? 我是否需要安装Apache Spark和/或Scala才能运行jUnit? - Do I need to install Apache Spark and/or Scala to run a jUnit?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM