簡體   English   中英

控件配置為寫為saveAsTextFile的Apache Spark UTF編碼設置

[英]Control configure set Apache Spark UTF encoding for writting as saveAsTextFile

因此,如何在使用saveAsTextFile(path)時告訴spark使用哪個UTF? 當然,如果知道所有字符串都是UTF-8,那么它將在磁盤上節省2倍的空間! (假設默認的UTF是16,例如java)

saveAsTextFile實際上使用hadoop中的Text ,該Text編碼為UTF-8。

def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
    this.map(x => (NullWritable.get(), new Text(x.toString)))
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
  }

從Text.java:

public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {

  static final int SHORT_STRING_MAX = 1024 * 1024;

  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

如果您想另存為UTF-16,我想您可以將saveAsHadoopFileorg.apache.hadoop.io.BytesWritable一起使用,並獲取Java String (如您所說的將是UTF-16)的字節。 像這樣:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
您可以從"...".getBytes("UTF-16")獲取字節"...".getBytes("UTF-16")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM