字符串迭代器到字節輸入流

Question

我想將字符串的迭代器轉換為字節的輸入流。 通常，我可以通過在StringBuilder中附加所有字符串並執行以下操作來做到這一點： InputStream is = new ByteArrayInputStream(sb.toString().getBytes());

但我想懶惰地做，因為我的迭代是由 Spark 提供的，而且長度可能非常大。 我在 Scala 中找到了這個例子：

  def rowsToInputStream(rows: Iterator[String], delimiter: String): InputStream = {
  val bytes: Iterator[Byte] = rows.map { row =>
    (row + "\n").getBytes
  }.flatten

  new InputStream {
    override def read(): Int = if (bytes.hasNext) {
      bytes.next & 0xff // bitwise AND - make the signed byte an unsigned int from 0-255
    } else {
      -1
    }
  }
}

但我找不到將其轉換為 Java 的簡單方法。 我使用 Spliterators.spliteratorUnknownSize 將iterator轉換為Spliterators.spliteratorUnknownSize但是getBytes輸出一個不容易變平的數組。 總的來說，它變得非常混亂。

在 Java 中是否有一種優雅的方法可以做到這一點？

Answer 1

如果你想要一個支持快速批量操作的InputStream ，你應該實現
int read(byte[] b, int off, int len)方法，不僅可以被讀取InputStream的代碼直接調用，而且是繼承方法的后端

int read(byte b[])
long skip(long n)
byte[] readAllBytes() (JDK 9)
int readNBytes(byte[] b, int off, int len) (JDK 9)
long transferTo(OutputStream out) (JDK 9)
byte[] readNBytes(int len) (JDK 11)
void skipNBytes(long n) (JDK 14)

當所述方法具有有效實施時，它將更有效地工作。

public class StringIteratorInputStream extends InputStream {
    private CharsetEncoder encoder;
    private Iterator<String> strings;
    private CharBuffer current;
    private ByteBuffer pending;

    public StringIteratorInputStream(Iterator<String> it) {
        this(it, Charset.defaultCharset());
    }
    public StringIteratorInputStream(Iterator<String> it, Charset cs) {
        encoder = cs.newEncoder();
        strings = Objects.requireNonNull(it);
    }

    @Override
    public int read() throws IOException {
        for(;;) {
            if(pending != null && pending.hasRemaining())
                return pending.get() & 0xff;
            if(!ensureCurrent()) return -1;
            if(pending == null) pending = ByteBuffer.allocate(4096);
            else pending.compact();
            encoder.encode(current, pending, !strings.hasNext());
            pending.flip();
        }
    }

    private boolean ensureCurrent() {
        while(current == null || !current.hasRemaining()) {
            if(!strings.hasNext()) return false;
            current = CharBuffer.wrap(strings.next());
        }
        return true;
    }

    @Override
    public int read(byte[] b, int off, int len) {
        // Objects.checkFromIndexSize(off, len, b.length); // JDK 9
        int transferred = 0;
        if(pending != null && pending.hasRemaining()) {
            boolean serveByBuffer = pending.remaining() >= len;
            pending.get(b, off, transferred = Math.min(pending.remaining(), len));
            if(serveByBuffer) return transferred;
            len -= transferred;
            off += transferred;
        }
        ByteBuffer bb = ByteBuffer.wrap(b, off, len);
        while(bb.hasRemaining() && ensureCurrent()) {
            int r = bb.remaining();
            encoder.encode(current, bb, !strings.hasNext());
            transferred += r - bb.remaining();
        }
        return transferred == 0? -1: transferred;
    }
}

一個ByteBuffer基本上是byte buf[];的組合。 , int pos; , 和int count; 解決方案的變量。 但是，只有在調用者真正使用int read()方法讀取單個字節時，才會初始化pending的緩沖區。 否則，代碼會創建一個ByteBuffer來包裝調用者提供的目標緩沖區，以將字符串直接編碼到其中。

CharBuffer遵循相同的概念，僅適用於char序列。 在此代碼中，它將始終是其中一個字符串的包裝器，而不是具有自己存儲空間的緩沖區。 所以在最好的情況下，這個InputStream實現會將所有迭代器提供的字符串編碼到調用者提供的緩沖區中，而不需要中間存儲。

這個概念已經暗示了惰性處理，因為沒有中間存儲，只有適合調用者提供的緩沖區，換句話說，只要調用者請求，將從迭代器中獲取。

Answer 2

根據@Kayaman 的建議，我從ByteArrayInputStream中獲取了一個頁面，並手動使用Iterator<String>處理了字節數組的切換。 這比流方法的性能要高得多：

import java.io.InputStream;
import java.util.Iterator;

public class StringIteratorInputStream extends InputStream {
    protected byte buf[];
    protected int pos;
    protected int count;
    private Iterator<String> rows;

    public StringIteratorInputStream(Iterator<String> rows) {
        this.rows = rows;
        this.count = -1;
    }

    private void init(byte[] buf) {
        this.buf = buf;
        this.pos = 0;
        this.count = buf.length;
    }

    public int read() {
        if (pos < count) {
           return (buf[pos++] & 0xff);
        } else if (rows.hasNext()) {
            init(rows.next().getBytes());
            return (buf[pos++] & 0xff);
        } else {
            return -1;
        }
    }

}

我沒有擴展ByteArrayInputStream因為它的read是同步的，我不需要它。

字符串迭代器到字節輸入流

問題描述

2 個解決方案

解決方案1
3 已采納 2020-06-02 20:23:15

解決方案2
2 2020-06-01 19:43:18

字符串迭代器到字節輸入流

問題描述

2 個解決方案

解決方案1 3 已采納 2020-06-02 20:23:15

解決方案2 2 2020-06-01 19:43:18

解決方案1
3 已采納 2020-06-02 20:23:15

解決方案2
2 2020-06-01 19:43:18