简体   繁体   中英

Getting number of lines in a text file without readlines

Let's say I have a program that uses a .txt file to store data it needs to operate. Because it's a very large amount of data (just go with it) in the text file I was to use a generator rather than an iterator to go through the data in it so that my program leaves as much space as possible. Let's just say (I know this isn't secure) that it's a list of usernames. So my code would look like this (using python 3.3).

for x in range LenOfFile:
    id = file.readlines(x)
    if username == id:
       validusername = True
       #ask for a password
if validusername == True and validpassword == True:
    pass
else:
    print("Invalid Username")

Assume that valid password is set to True or False where I ask for a password. My question is, since I don't want to take up all of the RAM I don't want to use readlines() to get the whole thing, and with the code here I only take a very small amount of RAM at any given time. However, I am not sure how I would get the number of lines in the file (assume I cannot find the number of lines and add to it as new users arrive). Is there a way Python can do this without reading the entire file and storing it at once? I already tried len() , which apparently doesn't work on text files but was worth a try. The one way I have thought of to do this is not too great, it involves just using readlines one line at a time in a range so big the text file must be smaller, and then continuing when I get an error. I would prefer not to use this way, so any suggestions would be appreciated.

You can just iterate over the file handle directly, which will then iterate over it line-by-line:

for line in file:
    if username == line.strip():
       validusername = True
       break

Other than that, you can't really tell how many lines a file has without looking at it completely. You do know how big a file is, and you could make some assumptions on the character count for example (UTF-8 ruins that though :P); but you don't know how long each line is without seeing it, so you don't know where the line breaks are and as such can't tell how many lines there are in total. You still would have to look at every character one-by-one to see if a new line begins or not.

So instead of that, we just iterate over the file, and stop once whenever we read a whole line—that's when the loop body executes—and then we continue looking from that position in the file for the next line break, and so on.

Yes, the good news is you can find number of lines in a text file without readlines, for line in file, etc. More specifically in python you can use byte functions, random access, parallel operation, and regular expressions, instead of slow sequential text line processing. Parallel text file like CSV file line counter is particularly suitable for SSD devices which have fast random access, when combined with a many processor cores. I used a 16 core system with SSD to store the Higgs Boson dataset as a standard file which you can go download to test on. Even more specifically here are fragments from working code to get you started. You are welcome to freely copy and use but if you do then please cite my work thank you:

import re
from argparse import ArgumentParser
from multiprocessing import Pool
from itertools import repeat
from os import stat

unitTest = 0
fileName = None
balanceFactor = 2
numProcesses = 1

if __name__ == '__main__':
    argparser = ArgumentParser(description='Parallel text file like CSV file line counter is particularly suitable for SSD which have fast random access')
    argparser.add_argument('--unitTest', default=unitTest, type=int, required=False, help='0:False  1:True.')
    argparser.add_argument('--fileName', default=fileName, required=False, help='')
    argparser.add_argument('--balanceFactor', default=balanceFactor, type=int, required=False, help='integer: 1 or 2 or 3 are typical')
    argparser.add_argument('--numProcesses', default=numProcesses, type=int, required=False, help='integer: 1 or more. Best when matched to number of physical CPU cores.')
    cmd = vars(argparser.parse_args())
    unitTest=cmd['unitTest']
    fileName=cmd['fileName']
    balanceFactor=cmd['balanceFactor']
    numProcesses=cmd['numProcesses']

#Do arithmetic to divide partitions into startbyte, endbyte strips among workers (2 lists of int)
#Best number of strips to use is 2x to 3x number of workers, for workload balancing
#import numpy as np  # long heavy import but i love numpy syntax

    def PartitionDataToWorkers(workers, items, balanceFactor=2):
        strips = balanceFactor * workers
        step = int(round(float(items)/strips))
        startPos = list(range(1, items+1, step))
        if len(startPos) > strips:
            startPos = startPos[:-1]
        endPos = [x + step - 1 for x in startPos]
        endPos[-1] = items
        return startPos, endPos

    def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
        with open(fileName, 'r') as f:
            f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
            bytes = f.read(endByte - startByte + 1)
            cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
        return cnt

    if 0 == unitTest:
        # Run app, not unit tests.
        fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
        startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
        p = Pool(numProcesses)
        partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
        globalSum = sum(partialSum)
        print(globalSum)
    else: 
        print("Running unit tests") # Bash commands like: head --bytes 96 beer.csv  are how I found the correct values.
        fileName='beer.csv' # byte 98 is a newline
        assert(8==ReadFileSegment(1, 288, fileName))
        assert(1==ReadFileSegment(1, 100, fileName))
        assert(0==ReadFileSegment(1,  97, fileName))
        assert(1==ReadFileSegment(97, 98, fileName))
        assert(1==ReadFileSegment(98, 99, fileName))
        assert(0==ReadFileSegment(99, 99, fileName))
        assert(1==ReadFileSegment(98, 98, fileName))
        assert(0==ReadFileSegment(97, 97, fileName))
        print("OK")

The bash wc program is slightly faster but you wanted pure python, and so did I. Below is some performance testing results. That said if you change some of this code to use cython or something you might even get some more speed.

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

Conclusion: The speed is good for a pure python program compared to a C program. However, it's not good enough to use the pure python program over the C program.

I wondered if compiling the regex just one time and passing it to all workers will improve speed. Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.

One more thing. Does parallel CSV file reading even help, I wondered? Is the disk the bottleneck, or is it the CPU? Oh yes, yes it does. Parallel file reading works quite well. Well there you go!

Data science is a typical use case for pure python. I like to use python (jupyter) notebooks, and I like to keep all code in the notebook rather than use bash scripts when possible. Finding the number of examples in a dataset is a common need for doing machine learning where you generally need to partition a dataset into training, dev, and testing examples.

Higgs Boson dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS

If you want number of lines in a file so badly, why don't you use len

with open("filename") as f:
    num = len(f.readlines())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM