如何确定Java中文件每一行的字节数？

Question

I have a very big text file. 我有一个很大的文本文件。 I want to determine the number of bytes of each line and save it in another file. 我想确定每行的字节数并将其保存在另一个文件中。

Answer 1

Using java.io.BufferedReader, you can easily read each line as a separate String. 使用java.io.BufferedReader，您可以轻松地将每一行读取为单独的String。 The number of bytes used by a line depends on the encoding used. 一行使用的字节数取决于所使用的编码。 For a simple ASCII encoding, you can simply use the length of the String, since each character takes up one byte. 对于简单的ASCII编码，您可以简单地使用String的长度，因为每个字符占用一个字节。 For multi-byte encodings like UTF-8, you would need a more complicated approach. 对于UTF-8这样的多字节编码，您将需要一种更复杂的方法。

Answer 2

The following code extracts 以下代码摘录

   byte[] chunks  = null;
        BufferedReader  in = 
        new BufferedReader (new InputStreamReader(new FileInputStream(path +"/"+filePath),"UTF-8"));
        String eachLine  = "";  
        while( (eachLine = in.readLine()) != null) 
        {
            chunks = eachLine.getBytes("UTF-8");
            System.out.println(chunks.length);
        }

Answer 3

Create a loop that: 创建一个循环：

Read one line in at a time. 一次读一行。
Count the bytes 计数字节
Save it to another file. 将其保存到另一个文件。

Answer 4

If you have some definition of what constitutes a "line" in your big file, you can simply iterate over your file byte-by-byte and at each occurrence of a line end or line start you memorize the current index. 如果您对大文件中“行”的组成有一些定义，则可以简单地逐字节遍历文件，并且在每次出现行尾或行开始时，您都可以记住当前索引。

For example, if you have a unix text file (ie \\n as line delimiter), this may look like this: 例如，如果您有一个unix文本文件（即\\n作为行定界符），则可能如下所示：

/**
 * a simple class encapsulating information about a line in a file.
 */
public static class LineInfo {
    LineInfo(number, start, end) {
       this.lineNumber = number;
       this.startPos = start;
       this.endPos = end;
       this.length = endPos - startPos;
    }
    /** the line number of the line. */
    public final long lineNumber;
    /** the index of the first byte of this line. */
    public final long startPos;
    /** the index after the last byte of this line. */
    public final long endPos;
    /** the length of this line (not including the line separators surrounding it). */
    public final long length;
}

/**
 * creates an index of a file by lines.
 * A "line" is defined by a group of bytes between '\n'
 * bytes (or start/end of file).
 *
 * For each line, a LineInfo element is created and put into the List.
 * The list is sorted by line number, start positions and end positions.
 */
public static List<LineInfo> indexFileByLines(File f)
    throws IOException
{

    List<LineInfo> infos = new ArrayList<LineInfo>();

    InputStream in = new BufferedInputStream(new FileInputStream(f));
    int b;
    for(long index = 0, lastStart = 0, lineNumber = 0;
        (b = in.read()) >= 0 ;
        index++)
    {
        if(b == '\n') {
            LineInfo info = new LineInfo(lineNumber, lastStart, index);
            infos.add(info);
            lastStart = index + 1;
            lineNumber ++;
        }
    }
    return infos;
}

This avoids any conversion of bytes to chars, thus any encoding issues. 这样可以避免将字节转换为char，从而避免任何编码问题。 It still depends on the line separator being \\n - but there could be a parameter to give it to the method. 它仍然取决于行分隔符是否为\\n但可以有一个参数将其赋予该方法。

(For DOS/Windows files with \\r\\n as separator the condition is a bit more complicated, as we would either have to store the previous byte, or do a lookahead to the next one.) （对于使用\\r\\n作为分隔符的DOS / Windows文件，条件要复杂一些，因为我们要么必须存储前一个字节，要么对下一个字节进行前瞻。）

For easier use, maybe instead of a list a pair (or triple) of SortedMap<Long, LineInfo> could be better. 为了更容易使用，也许不是列表，而是一对（或三对） SortedMap<Long, LineInfo> 。

如何确定Java中文件每一行的字节数？

问题描述

4 个解决方案

解决方案1
2 2011-02-28 14:35:08

解决方案2
2 2011-02-28 15:03:46

解决方案3
1 2011-02-28 14:31:56

解决方案4
0 2011-02-28 20:39:24

如何确定Java中文件每一行的字节数？

问题描述

4 个解决方案

解决方案1 2 2011-02-28 14:35:08

解决方案2 2 2011-02-28 15:03:46

解决方案3 1 2011-02-28 14:31:56

解决方案4 0 2011-02-28 20:39:24

解决方案1
2 2011-02-28 14:35:08

解决方案2
2 2011-02-28 15:03:46

解决方案3
1 2011-02-28 14:31:56

解决方案4
0 2011-02-28 20:39:24