简体   繁体   English

返回的行数比Linux`wc -l`高得多的Python代码

[英]Python code for number of lines returning much higher number than Linux `wc -l`

When I do wc -l on a file in Linux (a CSV file of a couple million rows), it reports a number of lines that is lower than what this Python code shows (simply iterating over the lines in the file) by over a thousand. 当我在Linux上的文件(几百万行的CSV文件)上执行wc -l时,它报告的行数比该Python代码显示的行数少(简单地遍历文件中的行)千。 What would be the reason for that? 那是什么原因呢?

with open(csv) as csv_lines:
    num_lines = 0
    for line in csv_lines:
        num_lines += 1
    print(num_lines)

I've had cases where wc reports one less than the above, which makes sense in cases where the file has no terminating newline character, as it seems like wc counts complete lines (including terminating newline) while this code only counts any lines. 在某些情况下, wc报告少于上述情况,这在文件没有终止换行符的情况下是有意义的,因为wc似乎计数完整行(包括终止换行符),而此代码仅计数任何行。 But what would be the case for a difference of over a thousand lines? 但是,相差一千多行会是什么情况?

I don't know much about line endings and things like that, so maybe I've misunderstood how wc and this Python code count lines, so maybe someone could clarify. 我对行尾和诸如此类的东西并不了解,所以也许我误解了wc和此Python代码如何计数行,所以也许有人可以弄清楚。 In linux lines counting not working with python code it says that wc works by counting the number of \\n characters in the file. linux行中,计数不与python代码一起使用时,它表示wc通过计数文件中\\n字符的数量来工作。 But then what is tis Python code doing exactly? 但是Python代码到底在做什么呢?

Is there a way to reconcile the difference in numbers to figure out exactly what is causing it? 有没有办法调和数字上的差异以弄清楚是什么原因造成的? Like a way to calculate number of lines from Python that counts in the same way that wc does. 就像从Python计算行数的方法一样,计数与wc相同。

The file was generated possibly on a different platform that Linux, not sure if that might be related. 该文件可能是在与Linux不同的平台上生成的,不确定是否可能与之相关。

Since you are using print(num_lines) I'm assuming you are using Python 3.x, and I've used Python 3.4.2 as an example. 由于您使用的是print(num_lines)所以我假设您使用的是Python 3.x,并且以Python 3.4.2为例。

There reason for different number of line counts comes from the fact that the file opened by open(<name>) counts both \\r and \\n characters as separate lines as well as the \\r\\n combination ( docs , the universal newlines part). 行数不同的原因是由于open(<name>)的文件将\\r\\n字符都计为单独的行以及\\r\\n组合( docs通用换行部分)。 This leads to the following: 这导致以下结果:

>>> with open('test', 'w') as f:
        f.write('\r\r\r\r')

>>> with open('test') as f:
        print(sum(1 for _ in f))
4

whilst wc -l gives: wc -l给出:

$ wc -l test
0 test

The \\r character is used as a newline in ie old Macintosh systems. \\r字符在旧的Macintosh系统中用作换行符。

If you would like to split only on \\n characters, use the newline keyword argument to open : 如果只想分割\\n字符,请使用newline关键字参数open

>>> with open('test', 'w') as f:
        f.write('\r\r\r\r')

>>> with open('test', newline='\n') as f:
        print(sum(1 for _ in f))
1

The 1 comes from the fact you've already mentioned. 1来自您已经提到的事实。 There is not a single \\n character in the file so wc -l returns 0, and Python counts that as a single line. 文件中没有单个\\n字符,因此wc -l返回0,Python将其视为一行。

Try taking a part of the file and repeat line counting. 尝试取一部分文件并重复行计数。 For example: 例如:

# take first 10000 lines
head -10000 file.csv > file_head.csv

# take last 10000 lines
tail -10000 file.csv > file_tail.csv

# take first 100MB
dd if=file.csv of=file_100M.csv bs=1M count=100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM