简体   繁体   中英

python twice as fast than pypy reading/iterating a file

Usually, for regular simple python code, using pypy will be faster. But how come when I am reading a file and just splitting string and doing very simple logic, it is much slower than regular python.

Take the following code for reference.

First lets create a fake file using the code below:

NUM_ROWS = 10000000
FILENAME = "testing.txt"

def create_file():
    data = []
    for x in range(NUM_ROWS):
        data.append("AA BB CC DD EE FF GG HH II JJ KK LL MM NN OO\n")

    with open(FILENAME, "w") as f:
        for d in data:
            f.write(d)
        f.close()

This just creates a file called testing.txt .

Then we have a sample.py :

import datetime
FILENAME = "testing.txt"

start = datetime.datetime.now()
with open(FILENAME) as f:
    for i, line in enumerate(f):
        data = line.split(" ")
        if data[0] != "AA":
            print(i, line)
print(datetime.datetime.now() - start)

Doing C:\pypy3.6-v7.3.1-win32\pypy3.exe sample.py takes 42s while python sample.py takes 18s only

I am using python3.7 on a windows 10 machine. Is there a way to speed up such a simple script like the above using pypy? Am I using it wrong?

----------Update:

Apparently its the 'reading' or iterating through the file that is slow in pypy.

With sample.py as:

import datetime
FILENAME = "testing.txt"

start = datetime.datetime.now()
with open(FILENAME, "r") as f:
    for line in f:
        pass
print(datetime.datetime.now() - start)

I tried with latest pypy3 build as of 20200818. Here are my findings regarding the simple code above (just plain iterating line by line of a file with 10 million lines). In Windows, regular python (3.8) takes 2.3 secs for the above code to execute and with pypy3 it takes an awfully slow 30 secs. In Ubuntu, regular python takes 1.2 secs and takes 3.4 secs with pypy3. The linux version is definitely more acceptable, but the windows definitely need some work

Is there a way to speed up reading/iterating the file with pypy for windows?

the question was updated, so I update my answer:

as with every optimisation, You need to profile each part. Of course You should focus on the commands in the loop.

my solution (without profiling) to pass the same tests would be:

import datetime
FILENAME = "testing.txt"

start = datetime.datetime.now()
with open(FILENAME) as f:
    i = 0
    data = f.readline()
    while data:
        if not data.startswith('AA '):
            print(i, line)
        i += 1
        data = f.readline()

print(datetime.datetime.now() - start)


however, that was not the solution @user1179317 expected in the updated question @user1179317 is now aware that the reading of the file data chunks is the issue.

You can try to read the data in chunks, using yield:

def read_in_chunks(file_object, chunk_size=1024):
    """generator to read file in chunks"""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('big_file.data') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

Another option would be to use iter and a helper function:

f = open('big_file.dat')
def read_chunk(chunk_size=1024):
    return f.read(chunk_size)

for piece in iter(read_chunk, ''):
    process_data(piece)

again - this is not a ready to use past & copy answer - You need to profile and test, since the results will depend on file size, available ram, maybe block size of the harddisk, maybe IP packet sizes, etc...

Since that operation is I/O bound, a multi-threaded approach might be good - You might try to read the next chunk of the file in a seperate thread.

So - You need to profile with different chunk sizes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM