检查Python文件中行的最后一项的有效方法

Question

I'm writing a Python script that takes in a (potentially large) file. 我正在编写一个Python脚本，该脚本需要一个（可能很大）文件。 Here is an example of a way that input file could be formatted: 这是可以格式化输入文件的示例：

class1 1:v1 2:v2 3:v3 4:v4 5:v5
class2 1:v6 4:v7 5:v8 6:v9
class1 3:v10 4:v11 5:v12 6:v13 8:v14
class2 1:v15 2:v16 3:v17 5:v18 7:v19

Where class1 and class2 are some number, eg 1 and -1. 其中class1和class2是某个数字，例如1和-1。 (A curious user may notice that this is a LIBSVM-related file, but knowing the software isn't necessary in this case.) The values v1, v2, ..., v19 represent any integer or float value. （一个好奇的用户可能会注意到这是一个与LIBSVM相关的文件，但在这种情况下不需要了解软件。）值v1，v2，...，v19表示任何整数或浮点值。 Obviously, my files would be much larger than this, in terms of total lines and length per line, which is why I'm concerned about efficiency here. 显然，就总行数和每行长度而言，我的文件要比这大得多，这就是为什么我在这里关注效率。

I am trying to check what is the greatest value to the left of a colon. 我正在尝试检查冒号左侧的最大价值是什么。 In LIBSVM, these are called "features" and are always integers here. 在LIBSVM中，这些称为“功能”，此处始终为整数。 For instance, in the example I outlined above, line 1 has 5 as its largest feature. 例如，在上面概述的示例中，第1行的最大特征为5。 Line 2 has 6 as its largest feature, line 3 has 8 as its largest feature, and finally, line 4 has 7 as its largest feature. 第2行的最大特征为6，第3行的最大特征为8，最后，第4行的最大特征为7。 Since 8 is the largest of these values, that is my desired value. 由于8是这些值中的最大值，因此这是我想要的值。 I'm looking at a file with possibly thousands of features per line, and many hundreds of thousands of lines . 我正在查看一个文件，每行可能具有数千个功能，并且有数十万行。

The file satisfies the following properties: 该文件满足以下属性：

The features must be strictly increasing. 功能必须严格增加。 Ie "3:v1 4:v2" is allowed, but not "3:v1 3:v2." 即允许使用“ 3：v1 4：v2”，但不允许使用“ 3：v1 3：v2”。
The features are not necessarily consecutive and can be skipped. 这些功能不一定是连续的，可以跳过。 In the first example I gave, the first line has its features in consecutive order (1,2,3,4,5) and skips features 6, 7, and 8. The other 3 lines do not have their features in consecutive order. 在我给出的第一个示例中，第一行的特征按连续顺序（1,2,3,4,5）并跳过特征6、7和8。其他3行的特征按连续顺序不存在。 That's okay, as long as those features are strictly increasing. 只要这些功能严格增加，就可以。

Right now, my approach is to check each line, split up each line by a space, split up the final term by a colon, and then check the feature value. 现在，我的方法是检查每行，用空格将每行分开，用冒号将最后一项分开，然后检查要素值。 Following that, I do a procedure to check the maximum such featureNum. 之后，我执行一个程序来检查最大的featureNum。

file1 = open(...)
max = 0
for line in file1:
    linesplit = line.rstrip('\n').split(' ')
    val = linesplit[len(linesplit) - 1]
    valsplit = val.split(':')
    featureNum = valsplit[0]
    if (featureNum > max):
        max = featureNum
 print max
 file1.close()

But I'm hoping there is a better or more efficient way of doing this , eg some way of analyzing the file by only getting those terms that directly precede a newline character (maybe to avoid reading all the lines?). 但是我希望有一种更好或更有效的方法 ，例如通过仅获取换行符之前的那些术语来分析文件的某种方法（也许是避免读取所有行？）。 I'm new to Python so it wouldn't surprise me if I missed something obvious. 我是Python的新手，所以如果我错过了明显的事情，也不会感到惊讶。

Possible reference: http://docs.python.org/library/stdtypes.html 可能的参考： http : //docs.python.org/library/stdtypes.html

Answer 1

Since you don't care about all the features in a line but just the last one, you don't need to split the whole line. 由于您不必关心一行中的所有功能，而只关心最后一个，因此您无需拆分整行。 I don't know if this is actually faster though, you need to time it and see. 我不知道这实际上是否更快，您需要计时并看看。 It definitely isn't as Pythonic as splitting the entire line. 它绝对不像拆分整行的Pythonic。

def last_feature(line):
    start = line.rfind(' ') + 1
    end = line.rfind(':')
    return int(line[start:end])

with open(...) as file1:
    largest = max(last_feature(line) for line in file1)

检查Python文件中行的最后一项的有效方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-07-10 16:45:44

检查Python文件中行的最后一项的有效方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-07-10 16:45:44

解决方案1
1 已采纳 2012-07-10 16:45:44