[英]Open and read txt file that are space delimited
I have a space seperated txt file like following: 我有一个用空格分隔的txt文件,如下所示:
2004 Temperature for KATHMANDU AIRPORT
Tmax Tmin
1 18.8 2.4
2 19.0 1.1
3 18.3 1.7
4 18.3 1.0
5 17.8 1.3
I want to calculate the mean of both Tmax and Tmin seperately. 我想分别计算Tmax和Tmin的平均值。 But, I am having hard time reading txt file.
但是,我很难读取txt文件。 I tried this link like .
我尝试了这样的链接 。
import re
list_b = []
list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
list_line = re.findall(r"[\d.\d+']+", line)
list_b.append(float(list_line[1])) #appends second column
list_d.append(float(list_line[3])) #appends fourth column
print list_b
print list_d
But, it is giving me error : IndexError: list index out of range
what is wrong here? 但是,这给了我一个错误:
IndexError: list index out of range
这里有什么问题?
A simple way to solve that is to use split()
function. 一种简单的解决方法是使用
split()
函数。 Of course, you need to drop the first two lines: 当然,您需要删除前两行:
with io.open("path/to/file.txt", mode="r", encoding="utf-8") as f:
next(f)
next(f)
for line in f:
print(line.split())
You get: 你得到:
['1', '18.8', '2.4']
['2', '19.0', '1.1']
['3', '18.3', '1.7']
['4', '18.3', '1.0']
['5', '17.8', '1.3']
Quoting the documentation: 引用文档:
If sep is not specified or is
None
, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.如果未指定sep或为
None
,则将应用不同的拆分算法:连续的空白行将被视为单个分隔符,并且如果字符串具有前导或尾随空格,则结果在开头或结尾将不包含空字符串。
As stated here , re.findall
lists all matches of your regular expression. 如前所述这里 ,
re.findall
名单正则表达式的所有比赛。 the expression you define does not match anything in your file and you therefore get an empty array, leading to the error when you try to access list_line[1]
. 您定义的表达式与文件中的任何内容都不匹配,因此会得到一个空数组,从而在尝试访问
list_line[1]
时导致错误。
r"\\d+\\.\\d+"
, matching any decimal number with at least one digit before the decimal point, that decimal point and at least one digit after it r"\\d+\\.\\d+"
,匹配任何十进制数字,该数字必须与小数点前至少一位数字,该小数点后至少一位数字相同 0
and 1
0
和1
so: import re list_b = [] list_d = [] 所以:import re list_b = [] list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
list_line = re.findall(r'\d+\.\d+', line)
if len(list_line) == 2 :
list_b.append(float(list_line[0])) #appends second column
list_d.append(float(list_line[1])) #appends fourth column
print list_b
print list_d
import re
list_b = []
list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
# regex is corrected to match the decimal values only
list_line = re.findall(r"\d+\.\d+", line)
# error condition handled where the values are not found
if len(list_line) < 2:
continue
# indexes are corrected below
list_b.append(float(list_line[0])) #appends second column
list_d.append(float(list_line[1])) #appends fourth column
print list_b
print list_d
I have added my answer with some comments in the code itself. 我在代码本身中添加了一些注释并添加了答案。
You were getting the Index out of range error
because your list_line was having only a single element(ie 2004 in the first line of file) and you were trying to access the 1st and 3rd index of the list_line. 您收到的
Index out of range error
是因为list_line仅具有一个元素(即文件的第一行中为2004),并且您试图访问list_line的第一个索引和第三个索引。
Full Solution 完整解决方案
def readit(file_name,start_line = 2): # start_line - where your data starts (2 line mean 3rd line, because we start from 0th line)
with open(file_name,'r') as f:
data = f.read().split('\n')
data = [i.split(' ') for i in data[start_line:]]
for i in range(len(data)):
row = [(sub) for sub in data[i] if len(sub)!=0]
yield int(row[0]),float(row[1]),float(row[2])
iterator = readit('TA103019.95.txt')
index, tmax, tmin = zip(*iterator)
mean_Tmax = sum(tmax)/len(tmax)
mean_Tmin = sum(tmin)/len(tmin)
print('Mean Tmax: ',mean_Tmax)
print('Mean Tmnin: ',mean_Tmin)
>>> ('Mean Tmax: ', 18.439999999999998)
>>> ('Mean Tmnin: ', 1.5)
Thanks to Dan D. for more Elegant solution 感谢Dan D.提供更优雅的解决方案
Simplify your life and avoid 're' for this problem. 简化您的生活,避免再次遇到这个问题。
Perhaps you are reading the header row mistakenly? 也许您误读了标题行? If the format of the file is fixed, I usually "burn" the header row with a line read before starting the loop like:
如果文件的格式是固定的,我通常在开始循环之前先用读取的行“烧写”标题行,例如:
with open(file_name, 'r') as f:
f.readline() # burn the header row
for line in f:
tokens = line.strip().split(' ') # tokenize the row based on spaces
Then you have a list of tokens, which will be strings that you'll need to convert to int or float or whatever and go from there! 然后,您将获得一个令牌列表,这些令牌将是您需要转换为int或float或从那里开始的字符串!
Put in a couple print statements to see what you are picking up... 输入几个打印语句,以查看您要提取的内容...
Is it possible that your file is tab delimited? 您的文件是否可能用制表符分隔?
For Tab Delimited: 对于制表符分隔:
with open('TA103019.95.txt', 'r') as f:
for idx, line in enumerate(f):
if idx > 1:
cols = line.split('\t'): #for space delimited change '\t' to ' '
tmax = float(col[1])
tmin = float(col[2])
#calc mean
mean = (tmax + tmin) / 2
#not sure what you want to do with the result
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.