简体   繁体   中英

Python Why use open(filename) twice in this code?

Here's a piece of code from Machine Learning in Action Chap2. The goal is to transfer a file to matix. What I dont understand is why should I use fr=open(filename) twice?

When I delete the second open(filename), the code just return blank matrix. And I cant figure it why.

Thanks a lot for taking time!

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())        
    returnMat = zeros((numberOfLines,3))       
    classLabelVector = []                       
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

It reads the file twice:

  1. Firstly it reads all lines, then counts the lines and initializes the matrix:

     fr = open(filename) numberOfLines = len(fr.readlines()) returnMat = zeros((numberOfLines,3)) 
  2. Secondly it reads the file again to fill the matrix:

     fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() ... 

And it needs to open the file again, to start reading from its beginning again.

It's not an effective code. Since fr.readlines() reads the whole file, there's no need to read the file again, instead the result (list of lines) should be stored in a variable and reused when filling the matrix.

Also close() should be called when finished dealing with the file.

When you use the readlines function, it reads all the lines into memory and by the end of it the file pointer is at the very end of the file.

So if you try to readlines again after having used it already, since the file pointer is at the end it will read from the end to the end, hence the blank matrix.

They reopened the file so that the file pointer is back at the beginning. Another way of doing that is filevariable.seek(0) that will move the file pointer back to the start and you should be able to use readlines again.

One thing to note is that readlines reads the whole file into memory, if you have a massive file you should use a for loop and use readline to read one line at a time.

It is now recommended to always use context managers when working with files. Try this below, it should be pretty close to what you are looking for.

def file2matrix(filename):
    with open(filename, "r") as fr:
        returnMat = zeros((len(fr.readlines,3))
        classLabelVector = [] 
        index = 0
        for line in fr:
            line = line.strip()
            listFromLine = line.split('\t')
            returnMat[index,:] = listFromLine[0:3]
            classLabelVector.append(int(listFromLine[-1]))
            index += 1
    return returnMat,classLabelVector

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM