[英]Using interpolation search to find beginning of list in large text file - Python
I need to find the last timestamp in a very large log file with an unknown number of lines before I reach a line with a timestamp. 在到达带有时间戳的行之前,我需要在一个非常大的日志文件中找到未知行数的最后一个时间戳。 I read the file backwards one line at a time, which is usually very quick except for one case.
我一次向后读该文件,除了一种情况外,通常非常快。 Sometimes, I will run into a very large block (thousands of lines) with a known repeating pattern (one entry shown below) and no timestamps:
有时,我会遇到一个很大的块(成千上万行),具有已知的重复模式(下面显示一个条目)并且没有时间戳:
goal_tolerance[0]:
name: joint_b
position: -1
velocity: -1
acceleration: -1
Since this is the only case where I have this kind of problem, I can just throw a piece of code into the loop that checks for it before searching the log line by line. 由于这是我遇到此类问题的唯一情况,因此我可以在逐行搜索日志之前将一段代码放入循环中进行检查。
The number after goal_tolerance
is a counter, going up 1 each time the pattern repeats, so what I would like to do is use that number to calculate the beginning of the pattern. goal_tolerance
之后的goal_tolerance
是一个计数器,每次模式重复时都会增加1,所以我想做的就是使用该数字来计算模式的开始。 What I have now looks something like this: 我现在所拥有的看起来像这样:
if ' goal_tolerance' in line:
gtolnum = line[17:-3]
print gtolnum
startFrom = currentPosition - ((long(gtolnum) + 1) * 95)
break
However, this does not take into account the number of characters in the counter, so I end up running through the search loop several more times than necessary. 但是,这没有考虑到计数器中的字符数,因此我最终在搜索循环中运行的次数比必要次数多。 Is there a fast way to include those characters in the calculation?
有没有一种快速的方法可以在计算中包括这些字符?
EDIT: I do not read the entire file to get to that point, since it is large and I have several hundred timestamps to search for in several hundred log files. 编辑:我不读整个文件到那一点,因为它很大,我有几百个时间戳来搜索几百个日志文件。 My search function seeks to a position in the text file, then finds the beginning of a line near that point and reads it.
我的搜索功能将查找文本文件中的某个位置,然后找到该点附近的行的开头并进行读取。 The calculation is determining a file position I can use with .seek() based on the number of bytes or characters in the pattern.
计算是根据模式中的字节数或字符数确定我可以与.seek()一起使用的文件位置。
I did some maths in the meantime and came up with a mathematical solution: 在此期间,我做了一些数学运算,并提出了数学解决方案:
...
n = long(gtolnum)
q = len(gtolnum) # I'll refer to this as the number's "level"
x = n + 1 - 10**(q - 1) # Number of entries in the current level
c = x * (q - 1) # Additional digits in the current level
i = 2
p = 0
while i < q:
p += 9 * (q - i) * (10**(q - i)) # Additional digits in i levels previous
i += 1
startFrom = currentPosition - ((n + 1) * 95 + p + c)
...
Seems like there should be a much simpler solution, but I'm not seeing it. 似乎应该有一个更简单的解决方案,但我没有看到它。 Perhaps a log function could help?
也许日志功能可以帮助?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.