简体   繁体   中英

Extracting content from the Hadoop Text object

I am working with a large text inside a Text object from the Hadoop ( 0.20.203.0 ) Java library. I need to extract XML content from it without converting the whole object to a Java String ( by using.toString() ).

Could someone please give an example on how to do this?

Reading the documentation ( http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html ), I assume that I will need to use the.decode() functions.

Text t = "....<content>secret</content>...."
int start = t.find("<content>");
int end = t.find("</content>", start);
t.decode(String.getBytes(), start+7, end);

I don't understand how to use the first parameter of the function, though.

Your code looks mostly correct. The first parameter of decode is the byte array you want to create a String from.

From the docs:

public static String decode(byte[] utf8, int start, int length) 

It says utf8 only to say that it expects your byte buffer to be in UTF-8 format (which Text uses by default). So your code would be:

Text.decode(t.getBytes(), start+7, end);

since decode is a static function. Also, looking at the source for Text , this should not increase your memory footprint because getBytes() returns the reference to the underlying byte array that a Text object holds.

By the way, I could find the solution to the specific problem of parsing the content between two XML tags :

int start = t.find("<content>", 0);
int end = t.find("</content>", start);
int advance = "<content>".length();

try {
  content = Text.decode(t.getBytes(), start+advance, end-start-advance);
} catch (IOException e) {
  System.out.println("IOException was " + e.getMessage());
}

The last parameter is the length of the content to extract, not its final position (which was the mistake in the initial post).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM