简体   繁体   中英

Python code for number of lines returning much higher number than Linux `wc -l`

When I do wc -l on a file in Linux (a CSV file of a couple million rows), it reports a number of lines that is lower than what this Python code shows (simply iterating over the lines in the file) by over a thousand. What would be the reason for that?

with open(csv) as csv_lines:
    num_lines = 0
    for line in csv_lines:
        num_lines += 1
    print(num_lines)

I've had cases where wc reports one less than the above, which makes sense in cases where the file has no terminating newline character, as it seems like wc counts complete lines (including terminating newline) while this code only counts any lines. But what would be the case for a difference of over a thousand lines?

I don't know much about line endings and things like that, so maybe I've misunderstood how wc and this Python code count lines, so maybe someone could clarify. In linux lines counting not working with python code it says that wc works by counting the number of \\n characters in the file. But then what is tis Python code doing exactly?

Is there a way to reconcile the difference in numbers to figure out exactly what is causing it? Like a way to calculate number of lines from Python that counts in the same way that wc does.

The file was generated possibly on a different platform that Linux, not sure if that might be related.

Since you are using print(num_lines) I'm assuming you are using Python 3.x, and I've used Python 3.4.2 as an example.

There reason for different number of line counts comes from the fact that the file opened by open(<name>) counts both \\r and \\n characters as separate lines as well as the \\r\\n combination ( docs , the universal newlines part). This leads to the following:

>>> with open('test', 'w') as f:
        f.write('\r\r\r\r')

>>> with open('test') as f:
        print(sum(1 for _ in f))
4

whilst wc -l gives:

$ wc -l test
0 test

The \\r character is used as a newline in ie old Macintosh systems.

If you would like to split only on \\n characters, use the newline keyword argument to open :

>>> with open('test', 'w') as f:
        f.write('\r\r\r\r')

>>> with open('test', newline='\n') as f:
        print(sum(1 for _ in f))
1

The 1 comes from the fact you've already mentioned. There is not a single \\n character in the file so wc -l returns 0, and Python counts that as a single line.

Try taking a part of the file and repeat line counting. For example:

# take first 10000 lines
head -10000 file.csv > file_head.csv

# take last 10000 lines
tail -10000 file.csv > file_tail.csv

# take first 100MB
dd if=file.csv of=file_100M.csv bs=1M count=100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM