使用插值搜索在大文本文件中查找列表的开头-Python

Question

I need to find the last timestamp in a very large log file with an unknown number of lines before I reach a line with a timestamp. 在到达带有时间戳的行之前，我需要在一个非常大的日志文件中找到未知行数的最后一个时间戳。 I read the file backwards one line at a time, which is usually very quick except for one case. 我一次向后读该文件，除了一种情况外，通常非常快。 Sometimes, I will run into a very large block (thousands of lines) with a known repeating pattern (one entry shown below) and no timestamps: 有时，我会遇到一个很大的块（成千上万行），具有已知的重复模式（下面显示一个条目）并且没有时间戳：

  goal_tolerance[0]: 
    name: joint_b
    position: -1
    velocity: -1
    acceleration: -1

Since this is the only case where I have this kind of problem, I can just throw a piece of code into the loop that checks for it before searching the log line by line. 由于这是我遇到此类问题的唯一情况，因此我可以在逐行搜索日志之前将一段代码放入循环中进行检查。

The number after goal_tolerance is a counter, going up 1 each time the pattern repeats, so what I would like to do is use that number to calculate the beginning of the pattern. goal_tolerance之后的goal_tolerance是一个计数器，每次模式重复时都会增加1，所以我想做的就是使用该数字来计算模式的开始。 What I have now looks something like this: 我现在所拥有的看起来像这样：

if '  goal_tolerance' in line:
    gtolnum = line[17:-3]
    print gtolnum
    startFrom = currentPosition - ((long(gtolnum) + 1) * 95)
    break

However, this does not take into account the number of characters in the counter, so I end up running through the search loop several more times than necessary. 但是，这没有考虑到计数器中的字符数，因此我最终在搜索循环中运行的次数比必要次数多。 Is there a fast way to include those characters in the calculation? 有没有一种快速的方法可以在计算中包括这些字符？

EDIT: I do not read the entire file to get to that point, since it is large and I have several hundred timestamps to search for in several hundred log files. 编辑：我不读整个文件到那一点，因为它很大，我有几百个时间戳来搜索几百个日志文件。 My search function seeks to a position in the text file, then finds the beginning of a line near that point and reads it. 我的搜索功能将查找文本文件中的某个位置，然后找到该点附近的行的开头并进行读取。 The calculation is determining a file position I can use with .seek() based on the number of bytes or characters in the pattern. 计算是根据模式中的字节数或字符数确定我可以与.seek（）一起使用的文件位置。

Answer 1

I did some maths in the meantime and came up with a mathematical solution: 在此期间，我做了一些数学运算，并提出了数学解决方案：

...
n = long(gtolnum)
q = len(gtolnum)        # I'll refer to this as the number's "level"
x = n + 1 - 10**(q - 1) # Number of entries in the current level
c = x * (q - 1)         # Additional digits in the current level
i = 2
p = 0
while i < q:
    p += 9 * (q - i) * (10**(q - i))  # Additional digits in i levels previous
    i += 1
startFrom = currentPosition - ((n + 1) * 95 + p + c)
...

Seems like there should be a much simpler solution, but I'm not seeing it. 似乎应该有一个更简单的解决方案，但我没有看到它。 Perhaps a log function could help? 也许日志功能可以帮助？

使用插值搜索在大文本文件中查找列表的开头-Python

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-01-24 14:33:34

使用插值搜索在大文本文件中查找列表的开头-Python

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-01-24 14:33:34

解决方案1
0 已采纳 2018-01-24 14:33:34