简体   繁体   中英

How to change binary file into RDD or Dataframe?

http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds

The link shows how to change txt file into RDD, and then change to Dataframe.

So how to deal with binary file ?

Ask for an example ,Thank you very much .

There is a similar question without answer here : reading binary data into (py) spark DataFrame

To be more detail, I don't know how to parse the binary file .for example , I can parse txt file into lines or words like this:

JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
  new Function<String, Person>() {
    public Person call(String line) throws Exception {
      String[] parts = line.split(",");

      Person person = new Person();
      person.setName(parts[0]);
      person.setAge(Integer.parseInt(parts[1].trim()));

      return person;
    }
  });

It seems that I just need the API that could parse the binary file or binary stream like this way:

 JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.bin").map(
      new Function<String, Person>() {
        public Person call(/*stream or binary file*/) throws Exception {
          /*code to construct every row*/
          return person;
        }
      });

EDIT: The binary file contains structure data (relational database 's table,the database is a self-made database) and I know the meta info of the structure data.I plan to change the structure data into RDD[Row].

And I could change every thing about the binary file when I use FileSystem 's API ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html ) to write the binary stream into HDFS .And The binary file is splittable. I don't have any idea to parse the binary file like the example code above . So I cann't try anything so far.

There is a binary record reader that is already available for spark (I believe available in 1.3.1, atleast in the scala api).

sc.binaryRecord(path: string, recordLength: int, conf)

Its on you though to convert those binaries to an acceptable format for processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM