简体   繁体   中英

Open and read txt file that are space delimited

I have a space seperated txt file like following:

2004          Temperature for KATHMANDU AIRPORT       
        Tmax  Tmin
     1  18.8   2.4 
     2  19.0   1.1 
     3  18.3   1.7 
     4  18.3   1.0 
     5  17.8   1.3 

I want to calculate the mean of both Tmax and Tmin seperately. But, I am having hard time reading txt file. I tried this link like .

import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        list_line = re.findall(r"[\d.\d+']+", line)
        list_b.append(float(list_line[1])) #appends second column
        list_d.append(float(list_line[3])) #appends fourth column

print list_b
print list_d

But, it is giving me error : IndexError: list index out of range what is wrong here?

A simple way to solve that is to use split() function. Of course, you need to drop the first two lines:

with io.open("path/to/file.txt", mode="r", encoding="utf-8") as f:
    next(f)
    next(f)
    for line in f:
        print(line.split())

You get:

['1', '18.8', '2.4']
['2', '19.0', '1.1']
['3', '18.3', '1.7']
['4', '18.3', '1.0']
['5', '17.8', '1.3']

Quoting the documentation:

If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

As stated here , re.findall lists all matches of your regular expression. the expression you define does not match anything in your file and you therefore get an empty array, leading to the error when you try to access list_line[1] .

  • the expression you want to match base on that file would be r"\\d+\\.\\d+" , matching any decimal number with at least one digit before the decimal point, that decimal point and at least one digit after it
  • even this expression will not match anything in the first two lines, so you will want to check for empty arrays
  • the result does not know of any columns, just matches of the pattern, and there will be two matches for each data line - you will want indizes 0 and 1

so: import re list_b = [] list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        list_line = re.findall(r'\d+\.\d+', line)
        if len(list_line) == 2 :
            list_b.append(float(list_line[0])) #appends second column
            list_d.append(float(list_line[1])) #appends fourth column

print list_b
print list_d
import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        # regex is corrected to match the decimal values only
        list_line = re.findall(r"\d+\.\d+", line) 

        # error condition handled where the values are not found 
        if len(list_line) < 2: 
            continue

        # indexes are corrected below
        list_b.append(float(list_line[0])) #appends second column
        list_d.append(float(list_line[1])) #appends fourth column

print list_b
print list_d

I have added my answer with some comments in the code itself.

You were getting the Index out of range error because your list_line was having only a single element(ie 2004 in the first line of file) and you were trying to access the 1st and 3rd index of the list_line.

Full Solution

def readit(file_name,start_line = 2): # start_line - where your data starts (2 line mean 3rd line, because we start from 0th line) 
    with open(file_name,'r') as f:
        data = f.read().split('\n')
    data = [i.split(' ') for i in data[start_line:]]
    for i in range(len(data)):
        row = [(sub) for sub in data[i] if len(sub)!=0]
        yield int(row[0]),float(row[1]),float(row[2])


iterator = readit('TA103019.95.txt')


index, tmax, tmin = zip(*iterator)


mean_Tmax = sum(tmax)/len(tmax)
mean_Tmin = sum(tmin)/len(tmin)
print('Mean Tmax: ',mean_Tmax)
print('Mean Tmnin: ',mean_Tmin)

>>> ('Mean Tmax: ', 18.439999999999998)
>>> ('Mean Tmnin: ', 1.5)

Thanks to Dan D. for more Elegant solution

Simplify your life and avoid 're' for this problem.

Perhaps you are reading the header row mistakenly? If the format of the file is fixed, I usually "burn" the header row with a line read before starting the loop like:

with open(file_name, 'r') as f:
    f.readline()  # burn the header row
    for line in f:
        tokens = line.strip().split(' ')   # tokenize the row based on spaces

Then you have a list of tokens, which will be strings that you'll need to convert to int or float or whatever and go from there!

Put in a couple print statements to see what you are picking up...

Is it possible that your file is tab delimited?

For Tab Delimited:

with open('TA103019.95.txt', 'r') as f:
    for idx, line in enumerate(f):
        if idx > 1:                    
            cols = line.split('\t'): #for space delimited change '\t' to ' '
            tmax = float(col[1])
            tmin = float(col[2])
            #calc mean

            mean = (tmax + tmin) / 2
            #not sure what you want to do with the result

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM