简体   繁体   中英

How to change binary file into RDD or Dataframe?


The link shows how to change txt file into RDD, and then change to Dataframe.

So how to deal with binary file ?

Ask for an example ,Thank you very much .

There is a similar question without answer here : reading binary data into (py) spark DataFrame

To be more detail, I don't know how to parse the binary file .for example , I can parse txt file into lines or words like this:

JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
  new Function<String, Person>() {
    public Person call(String line) throws Exception {
      String[] parts = line.split(",");

      Person person = new Person();

      return person;

It seems that I just need the API that could parse the binary file or binary stream like this way:

 JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.bin").map(
      new Function<String, Person>() {
        public Person call(/*stream or binary file*/) throws Exception {
          /*code to construct every row*/
          return person;

EDIT: The binary file contains structure data (relational database 's table,the database is a self-made database) and I know the meta info of the structure data.I plan to change the structure data into RDD[Row].

And I could change every thing about the binary file when I use FileSystem 's API ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html ) to write the binary stream into HDFS .And The binary file is splittable. I don't have any idea to parse the binary file like the example code above . So I cann't try anything so far.

There is a binary record reader that is already available for spark (I believe available in 1.3.1, atleast in the scala api).

sc.binaryRecord(path: string, recordLength: int, conf)

Its on you though to convert those binaries to an acceptable format for processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM