简体   繁体   English

在没有阅读行的情况下获取文本文件中的行数

[英]Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. 假设我有一个程序使用.txt文件存储需要操作的数据。 Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. 因为文本文件中有大量数据(随它一起处理),所以我将使用生成器而不是迭代器来遍历其中的数据,以便我的程序留出尽可能多的空间。 Let's just say (I know this isn't secure) that it's a list of usernames. 我们只是说(我知道这是不安全的)它是用户名列表。 So my code would look like this (using python 3.3). 所以我的代码看起来像这样(使用python 3.3)。

for x in range LenOfFile:
    id = file.readlines(x)
    if username == id:
       validusername = True
       #ask for a password
if validusername == True and validpassword == True:
    pass
else:
    print("Invalid Username")

Assume that valid password is set to True or False where I ask for a password. 假设我要求输入密码时,有效密码设置为True或False。 My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. 我的问题是,由于我不想占用所有RAM,所以我不想使用readlines()来获取全部内容,并且在这里的代码中,在任何给定的条件下,我只占用很少的RAM时间。 However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). 但是,我不确定如何获取文件中的行数(假设我找不到行数并在新用户到达时添加到行数)。 Is there a way Python can do this without reading the entire file and storing it at once? Python有没有一种方法可以在不读取整个文件并立即存储的情况下做到这一点? I already tried len() , which apparently doesn't work on text files but was worth a try. 我已经尝试过len() ,它显然不适用于文本文件,但是值得一试。 The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. 我考虑过的一种方法不太好,它涉及一次在范围如此之大的文本文件必须较小的范围内一次使用一行的读取行,然后在出现错误时继续执行。 I would prefer not to use this way, so any suggestions would be appreciated. 我不希望使用这种方式,因此任何建议将不胜感激。

You can just iterate over the file handle directly, which will then iterate over it line-by-line: 您可以直接在文件句柄上进行迭代,然后逐行对其进行迭代:

for line in file:
    if username == line.strip():
       validusername = True
       break

Other than that, you can't really tell how many lines a file has without looking at it completely. 除此之外,如果不完全查看文件,就无法真正分辨出文件有多少行。 You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); 您确实知道文件有多大,并且可以对字符数进行一些假设(例如,UTF-8破坏了:P的含义); but you don't know how long each line is without seeing it, so you don't know where the line breaks are and as such can't tell how many lines there are in total. 但是您不知道每行有多长时间而看不到它,因此您不知道换行符在哪里,因此无法知道总共有多少行。 You still would have to look at every character one-by-one to see if a new line begins or not. 您仍然必须一个一个地查看每个字符,以查看是否开始新行。

So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that's when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on. 因此,相反,我们只是遍历文件,并在每次读取整行时即循环主体执行时停止一次,然后继续从文件中的该位置查找下一个换行符,依此类推。

Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. 是的,好消息是您可以在文本文件中找到没有读行的行数,也可以在文件中找到行等。更具体地说,在python中,您可以使用字节函数,随机访问,并行操作和正则表达式,而不是慢速顺序文字行处理。 Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. 并行文本文件(如CSV文件行计数器)特别适合与多个处理器内核结合使用的具有快速随机访问权限的SSD设备。 I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. 我使用了16核系统和SSD,将希格斯玻色子数据集存储为标准文件,您可以下载该文件进行测试。 Even more specifically here are fragments from working code to get you started. 更具体地说,这里是工作代码的片段,可帮助您入门。 You are welcome to freely copy and use but if you do then please cite my work thank you: 欢迎您自由复制和使用,但如果这样做,请引用我的工作谢谢:

import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat

unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1

if __name__ == '__main__':
    argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
    argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False  1:True.')
    argparser.add_argument('--fileName', default=fileName, required=False, help='')
    argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
    argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
    cmd = vars(argparser.parse_args())
    unitTest=cmd['unitTest']
    fileName=cmd['fileName']
    balanceFactor=cmd['balanceFactor']
    numProcesses=cmd['numProcesses']

#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np  # long heavy import but i love numpy syntax

    def PartitionDataToWorkers(workers, items, balanceFactor=2):
        strips = balanceFactor * workers
        step = int(round(float(items)/strips))
        startPos = list(range(1, items+1, step))
        if len(startPos) > strips:
            startPos = startPos[:-1]
        endPos = [x + step - 1 for x in startPos]
        endPos[-1] = items
        return startPos, endPos

    def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
        with open(fileName, 'r') as f:
            f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
            bytes = f.read(endByte - startByte + 1)
            cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
        return cnt

    if 0 == unitTest:
        # Run app, not unit tests.
        fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
        startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
        p = Pool(numProcesses)
        partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
        globalSum = sum(partialSum)
        print(globalSum)
    else: 
        print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv  are how I found the correct values.
        fileName='beer.csv' # byte 98 is a newline
        assert(8==ReadFileSegment(1, 288, fileName))
        assert(1==ReadFileSegment(1, 100, fileName))
        assert(0==ReadFileSegment(1,  97, fileName))
        assert(1==ReadFileSegment(97, 98, fileName))
        assert(1==ReadFileSegment(98, 99, fileName))
        assert(0==ReadFileSegment(99, 99, fileName))
        assert(1==ReadFileSegment(98, 98, fileName))
        assert(0==ReadFileSegment(97, 97, fileName))
        print("OK")

The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. bash wc程序速度稍快,但是您需要纯python,我也是如此。下面是一些性能测试结果。 That said if you change some of this code to use cython or something you might even get some more speed. 这就是说,如果您将其中一些代码更改为使用cython或什至可以提高速度。

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

Conclusion: The speed is good for a pure python program compared to a C program. 结论:与C程序相比,纯python程序的速度不错。 However, it's not good enough to use the pure python program over the C program. 但是,在C程序上使用纯python程序还不够好。

I wondered if compiling the regex just one time and passing it to all workers will improve speed. 我想知道仅一次编译正则表达式并将其传递给所有工作人员是否会提高速度。 Answer: Regex pre-compiling does NOT help in this application. 答:正则表达式预编译在此应用程序中无济于事。 I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating. 我想原因是所有工人的过程序列化和创建的开销占主导。

One more thing. 还有一件事。 Does parallel CSV file reading even help, I wondered? 我想知道并行读取CSV文件是否有帮助? Is the disk the bottleneck, or is it the CPU? 磁盘是瓶颈,还是CPU? Oh yes, yes it does. 哦,是的,是的。 Parallel file reading works quite well. 并行文件读取效果很好。 Well there you go! 好吧,你去!

Data science is a typical use case for pure python. 数据科学是纯python的典型用例。 I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. 我喜欢使用python(jupyter)笔记本,并且喜欢将所有代码保留在笔记本中,而不是尽可能使用bash脚本。 Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples. 在机器学习中通常需要查找数据集中的示例数量,在机器学习中,通常需要将数据集划分为训练,开发和测试示例。

Higgs Boson dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS 希格斯玻色子数据集: https : //archive.ics.uci.edu/ml/datasets/HIGGS

If you want number of lines in a file so badly, why don't you use len 如果您非常想要一个文件中的行数,为什么不使用len

with open("filename") as f:
    num = len(f.readlines())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM