简体   繁体   English

使用插值搜索在大文本文件中查找列表的开头-Python

[英]Using interpolation search to find beginning of list in large text file - Python

I need to find the last timestamp in a very large log file with an unknown number of lines before I reach a line with a timestamp. 在到达带有时间戳的行之前,我需要在一个非常大的日志文件中找到未知行数的最后一个时间戳。 I read the file backwards one line at a time, which is usually very quick except for one case. 我一次向后读该文件,除了一种情况外,通常非常快。 Sometimes, I will run into a very large block (thousands of lines) with a known repeating pattern (one entry shown below) and no timestamps: 有时,我会遇到一个很大的块(成千上万行),具有已知的重复模式(下面显示一个条目)并且没有时间戳:

  goal_tolerance[0]: 
    name: joint_b
    position: -1
    velocity: -1
    acceleration: -1

Since this is the only case where I have this kind of problem, I can just throw a piece of code into the loop that checks for it before searching the log line by line. 由于这是我遇到此类问题的唯一情况,因此我可以在逐行搜索日志之前将一段代码放入循环中进行检查。

The number after goal_tolerance is a counter, going up 1 each time the pattern repeats, so what I would like to do is use that number to calculate the beginning of the pattern. goal_tolerance之后的goal_tolerance是一个计数器,每次模式重复时都会增加1,所以我想做的就是使用该数字来计算模式的开始。 What I have now looks something like this: 我现在所拥有的看起来像这样:

if '  goal_tolerance' in line:
    gtolnum = line[17:-3]
    print gtolnum
    startFrom = currentPosition - ((long(gtolnum) + 1) * 95)
    break

However, this does not take into account the number of characters in the counter, so I end up running through the search loop several more times than necessary. 但是,这没有考虑到计数器中的字符数,因此我最终在搜索循环中运行的次数比必要次数多。 Is there a fast way to include those characters in the calculation? 有没有一种快速的方法可以在计算中包括这些字符?

EDIT: I do not read the entire file to get to that point, since it is large and I have several hundred timestamps to search for in several hundred log files. 编辑:我不读整个文件到那一点,因为它很大,我有几百个时间戳来搜索几百个日志文件。 My search function seeks to a position in the text file, then finds the beginning of a line near that point and reads it. 我的搜索功能将查找文本文件中的某个位置,然后找到该点附近的行的开头并进行读取。 The calculation is determining a file position I can use with .seek() based on the number of bytes or characters in the pattern. 计算是根据模式中的字节数或字符数确定我可以与.seek()一起使用的文件位置。

I did some maths in the meantime and came up with a mathematical solution: 在此期间,我做了一些数学运算,并提出了数学解决方案:

...
n = long(gtolnum)
q = len(gtolnum)        # I'll refer to this as the number's "level"
x = n + 1 - 10**(q - 1) # Number of entries in the current level
c = x * (q - 1)         # Additional digits in the current level
i = 2
p = 0
while i < q:
    p += 9 * (q - i) * (10**(q - i))  # Additional digits in i levels previous
    i += 1
startFrom = currentPosition - ((n + 1) * 95 + p + c)
...

Seems like there should be a much simpler solution, but I'm not seeing it. 似乎应该有一个更简单的解决方案,但我没有看到它。 Perhaps a log function could help? 也许日志功能可以帮助?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM