简体   繁体   中英

Understanding file iteration in Python

I have been trying to do some text manipulation in Python and am running into a lot of issues, mainly due a fundamental misunderstanding of how file manipulation works in Python so I am hoping to clear that up.

So lets say I'm iterating through a text file called "my.txt" and it has the following contents:

3 10 7 8     
2 9 8 3  
4 1 4 2

The code I'm using to iterate through the file is:

file = open ("my.txt", 'r')
for line in file:
    print line`

I copied and pasted the above code from a tutorial. I know what it does but I don't know why it works and it's bothering me. I am trying to understand exactly what the variable "line" represents in the file. Is it a data type(a string?) or something else. My instinct tells me that each line represents a string which could then be manipulated(which is what I want) but I also understand that strings are immutable in Python.

What role is memory playing into all this, if my file is too big to fit into memory will it still work? Will line[3] allow me to access the fourth element in each line? If I only want to work on the second line can I do:

if line == 2: 

within the for loop?

It might be worth noting that I am pretty new to Python and am coming from a C\\C++ background(not used to immutable strings). I know I squeezed quite a few questions into one but any clarification on the general topic would really be helpful :)

line is a line of text, represented as a string. Strings are immutable, but that's not an issue for manipulating them; all variables in Python are references, and assigning to a variable points the reference to a new object . (In C++, you can't change where a reference points.) Iterating over a file iterates over the lines, so on each iteration, line refers to a new string representing the next line of the input file.

If you're familiar with range-based for loops or other language's for-each constructs, that's how Python's for works. The loop variable is not a counter; you can't do

if line == 2:

because line isn't the index of the line; it's the line itself. You could do

for i, line in enumerate(f):
    if i == 2:
        do_stuff_with(line)
        break  # No need to load the rest of the file

Note that file is the name of a builtin, so it's a bad idea to use that name for your own variables.

In Python, you can iterate straight over a file. The best way of doing this is with a with statement, as in:

with open("myfile.txt") as f:
    for i in f:
        # do stuff to each line in the file

The lines are strings representing each line (seperated by newlines) in the file. If you only want to operate on the second line, you could do something like this:

with open("myfile.txt") as f:
    list_of_file = list(f)
    second_line = list_of_file[2]

If you then want to access part of the second line you can split it by spaces into another list as so:

second_number_in_second_line = second_line.split()[1]

With regards to memory, iterating through the file directly does not read it all into memory, however, turning it into a list does. If you want to access individual lines without doing so, use itertools.islice .

In each iteration the line variable is filled with contents of subsequent lines read from the file. So, you'll have:

"3 10 7 8" in first iteration
"2 9 8 3" in second iteration
etc.

To get the numbers separately, use the split method: link .

So comparing line with 2 doesn't make sens. If you want to identify line numbers, you can try:

lineNumber = 0
for line in file:
  print line
  if lineNumber == 2:
    print "that was the second line!"
  lineNumber += 1

As suggested in the comment, you can simplify this by using enumerate :

for lineNumber, line in enumerate(file):
  print line
    if lineNumber == 2:
      print "that was the second line!"

Suppose you have your same file:

3 10 7 8\n     
2 9 8 3\n  
4 1 4 2\n

There are many file methods that operate on a file object

In Python, you can read a file character by character, C style:

with open('/tmp/test.txt', 'r') as fin:     # fin is a 'file object' 
    while True:
        ch=fin.read(1)
        if not ch:
            break
        print ch,                           # comma suppresses the CR

You can read the whole file as a single string:

with open('/tmp/test.txt', 'r') as fin:
    data=fin.read()
    print data    

As enumerated lines:

with open('/tmp/test.txt', 'r') as fin:
    for i, line in enumerate(fin):
        print i, line    

As a list of strings:

with open('/tmp/test.txt', 'r') as fin:
    data=fin.readlines()  

The idiom of looping over a file object:

for line in fin:                 # 'fin' is a file object result of open
    print line

is synonymous with:

for line in fin.readline():
    print line

and similar to:

for line in 'line 1\nline 2\nline 3'.splitlines():
    print line

Once you get used to the Python style loops (or Perl, or Obj C, or Java range style loops) that loop over the elements of something -- you use them without thinking about it much.

If you want the index of each item -- use enumerate

You can iterate over a file of any size, with the code you have shown, and it should not consume any significant amount of memory beyond the size of the longest single line.

As for how it works, under the hood, you could dive into the source code for Python itself to learn the gory details. At a higher level just consider that the implementor of file objects, in Python, chose to implement line-by-line iteration as a feature of their class.

Many of the collection data types and I/O interfaces in Python implement some form of iteration. Thus the for construct is the most common type of looping in Python. You can iterate over lists, tuples, and sets (by item), strings (by character), dictionaries (by key), and many classes (including those in the standard libraries as well as those from third parties) implement the " iterator (coding) protocol " to facilitate such usage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM