[英]Control configure set Apache Spark UTF encoding for writting as saveAsTextFile
因此,如何在使用saveAsTextFile(path)
時告訴spark使用哪個UTF? 當然,如果知道所有字符串都是UTF-8,那么它將在磁盤上節省2倍的空間! (假設默認的UTF是16,例如java)
saveAsTextFile
實際上使用hadoop中的Text
,該Text
編碼為UTF-8。
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
}
從Text.java:
public class Text extends BinaryComparable
implements WritableComparable<BinaryComparable> {
static final int SHORT_STRING_MAX = 1024 * 1024;
private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
new ThreadLocal<CharsetEncoder>() {
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
new ThreadLocal<CharsetDecoder>() {
protected CharsetDecoder initialValue() {
return Charset.forName("UTF-8").newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
如果您想另存為UTF-16,我想您可以將saveAsHadoopFile
與org.apache.hadoop.io.BytesWritable
一起使用,並獲取Java String
(如您所說的將是UTF-16)的字節。 像這樣:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
您可以從"...".getBytes("UTF-16")
獲取字節"...".getBytes("UTF-16")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.