简体   繁体   English

打开并读取以空格分隔的txt文件

[英]Open and read txt file that are space delimited

I have a space seperated txt file like following: 我有一个用空格分隔的txt文件,如下所示:

2004          Temperature for KATHMANDU AIRPORT       
        Tmax  Tmin
     1  18.8   2.4 
     2  19.0   1.1 
     3  18.3   1.7 
     4  18.3   1.0 
     5  17.8   1.3 

I want to calculate the mean of both Tmax and Tmin seperately. 我想分别计算Tmax和Tmin的平均值。 But, I am having hard time reading txt file. 但是,我很难读取txt文件。 I tried this link like . 我尝试了这样的链接

import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        list_line = re.findall(r"[\d.\d+']+", line)
        list_b.append(float(list_line[1])) #appends second column
        list_d.append(float(list_line[3])) #appends fourth column

print list_b
print list_d

But, it is giving me error : IndexError: list index out of range what is wrong here? 但是,这给了我一个错误: IndexError: list index out of range这里有什么问题?

A simple way to solve that is to use split() function. 一种简单的解决方法是使用split()函数。 Of course, you need to drop the first two lines: 当然,您需要删除前两行:

with io.open("path/to/file.txt", mode="r", encoding="utf-8") as f:
    next(f)
    next(f)
    for line in f:
        print(line.split())

You get: 你得到:

['1', '18.8', '2.4']
['2', '19.0', '1.1']
['3', '18.3', '1.7']
['4', '18.3', '1.0']
['5', '17.8', '1.3']

Quoting the documentation: 引用文档:

If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. 如果未指定sep或为None ,则将应用不同的拆分算法:连续的空白行将被视为单个分隔符,并且如果字符串具有前导或尾随空格,则结果在开头或结尾将不包含空字符串。

As stated here , re.findall lists all matches of your regular expression. 如前所述这里re.findall名单正则表达式的所有比赛。 the expression you define does not match anything in your file and you therefore get an empty array, leading to the error when you try to access list_line[1] . 您定义的表达式与文件中的任何内容都不匹配,因此会得到一个空数组,从而在尝试访问list_line[1]时导致错误。

  • the expression you want to match base on that file would be r"\\d+\\.\\d+" , matching any decimal number with at least one digit before the decimal point, that decimal point and at least one digit after it 您要基于该文件匹配的表达式将为r"\\d+\\.\\d+" ,匹配任何十进制数字,该数字必须与小数点前至少一位数字,该小数点后至少一位数字相同
  • even this expression will not match anything in the first two lines, so you will want to check for empty arrays 即使此表达式在前两行中都不匹配,所以您将需要检查空数组
  • the result does not know of any columns, just matches of the pattern, and there will be two matches for each data line - you will want indizes 0 and 1 结果不知道任何列,只是模式的匹配,并且每条数据线将有两个匹配-您将要归一化01

so: import re list_b = [] list_d = [] 所以:import re list_b = [] list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        list_line = re.findall(r'\d+\.\d+', line)
        if len(list_line) == 2 :
            list_b.append(float(list_line[0])) #appends second column
            list_d.append(float(list_line[1])) #appends fourth column

print list_b
print list_d
import re
list_b = []
list_d = []

with open('TA103019.95.txt', 'r') as f:
    for line in f:
        # regex is corrected to match the decimal values only
        list_line = re.findall(r"\d+\.\d+", line) 

        # error condition handled where the values are not found 
        if len(list_line) < 2: 
            continue

        # indexes are corrected below
        list_b.append(float(list_line[0])) #appends second column
        list_d.append(float(list_line[1])) #appends fourth column

print list_b
print list_d

I have added my answer with some comments in the code itself. 我在代码本身中添加了一些注释并添加了答案。

You were getting the Index out of range error because your list_line was having only a single element(ie 2004 in the first line of file) and you were trying to access the 1st and 3rd index of the list_line. 您收到的Index out of range error是因为list_line仅具有一个元素(即文件的第一行中为2004),并且您试图访问list_line的第一个索引和第三个索引。

Full Solution 完整解决方案

def readit(file_name,start_line = 2): # start_line - where your data starts (2 line mean 3rd line, because we start from 0th line) 
    with open(file_name,'r') as f:
        data = f.read().split('\n')
    data = [i.split(' ') for i in data[start_line:]]
    for i in range(len(data)):
        row = [(sub) for sub in data[i] if len(sub)!=0]
        yield int(row[0]),float(row[1]),float(row[2])


iterator = readit('TA103019.95.txt')


index, tmax, tmin = zip(*iterator)


mean_Tmax = sum(tmax)/len(tmax)
mean_Tmin = sum(tmin)/len(tmin)
print('Mean Tmax: ',mean_Tmax)
print('Mean Tmnin: ',mean_Tmin)

>>> ('Mean Tmax: ', 18.439999999999998)
>>> ('Mean Tmnin: ', 1.5)

Thanks to Dan D. for more Elegant solution 感谢Dan D.提供更优雅的解决方案

Simplify your life and avoid 're' for this problem. 简化您的生活,避免再次遇到这个问题。

Perhaps you are reading the header row mistakenly? 也许您误读了标题行? If the format of the file is fixed, I usually "burn" the header row with a line read before starting the loop like: 如果文件的格式是固定的,我通常在开始循环之前先用读取的行“烧写”标题行,例如:

with open(file_name, 'r') as f:
    f.readline()  # burn the header row
    for line in f:
        tokens = line.strip().split(' ')   # tokenize the row based on spaces

Then you have a list of tokens, which will be strings that you'll need to convert to int or float or whatever and go from there! 然后,您将获得一个令牌列表,这些令牌将是您需要转换为int或float或从那里开始的字符串!

Put in a couple print statements to see what you are picking up... 输入几个打印语句,以查看您要提取的内容...

Is it possible that your file is tab delimited? 您的文件是否可能用制表符分隔?

For Tab Delimited: 对于制表符分隔:

with open('TA103019.95.txt', 'r') as f:
    for idx, line in enumerate(f):
        if idx > 1:                    
            cols = line.split('\t'): #for space delimited change '\t' to ' '
            tmax = float(col[1])
            tmin = float(col[2])
            #calc mean

            mean = (tmax + tmin) / 2
            #not sure what you want to do with the result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM