[英]How to search tens of thousands of items in a list of lists using hundreds of patterns
I'm looking for advice on a better (faster) way to approach this. 我正在寻找更好(更快)的方法来解决这个问题。 My problem is that as you increase the length of the "hosts" list the program takes exponentially longer to complete, and if "hosts" is long enough it takes so long for the program to complete that it seems to just lock up.
我的问题是,随着你增加“主机”列表的长度,程序需要花费指数长的时间来完成,如果“主机”足够长,程序完成它需要很长时间,它似乎只是锁定。
My current approach is to use the regex patterns from the CSV file to search through every multi-line list item contained in the "hosts" i[7] element. 我目前的方法是使用CSV文件中的正则表达式模式来搜索“hosts”i [7]元素中包含的每个多行列表项。 There are 100's of possible matches, and I need to identify all matches associated with each IP address and assign the unique string from the CSV file to identify all pattern matches.
有100个可能的匹配项,我需要识别与每个IP地址关联的所有匹配项,并从CSV文件中分配唯一字符串以标识所有模式匹配项。 Finally, I need to put that information into the "fullMatchList" to use later.
最后,我需要将该信息放入“fullMatchList”以供稍后使用。
NOTE: Even though each list item in "searchPatterns" has up to 4 patterns, I only need it to identify the first pattern found and then it can move on to the next list item to continue finding matches for that IP. 注意:即使“searchPatterns”中的每个列表项最多有4个模式,我只需要它来识别找到的第一个模式,然后它可以继续查找下一个列表项以继续查找该IP的匹配项。
for i in hosts:
if i[4] == "13579" or i[4] == "24680":
for j in searchPatterns:
for k in range(4):
if j[k] == "SKIP":
continue
else:
match = re.search(r'%s' % j[k], i[7], flags=re.DOTALL)
if match is not None:
if tempIP == "":
tempIP = i[0]
matchListPerIP.append(j[4])
elif tempIP == i[0]:
matchListPerIP.append(j[4])
elif tempIP != i[0]:
fullMatchList.append([tempIP, matchListPerIP])
tempIP = i[0]
matchListPerIP = []
matchListPerIP.append(j[4])
break
fullMatchList.append([tempIP, matchListPerIP])
Here's an example regex search pattern from the CSV file: 以下是CSV文件中的示例正则表达式搜索模式:
(?!(.*?)\\br2\\b)cpe:/o:microsoft:windows_server_2008:
That pattern is intended to identify Windows Server 2008, and includes a negative lookahead to avoid matching the R2 edition. 该模式旨在识别Windows Server 2008,并包含一个负向前瞻以避免匹配R2版本。
I'm new to Python so any advice is appreciated! 我是Python的新手,所以任何建议都表示赞赏! Thank you!
谢谢!
The NIDS community has done a lot of work on testing the same string(s) (network packets) against a long list of regexes (firewall rules). NIDS社区在针对一长串正则表达式(防火墙规则)测试相同的字符串(网络数据包)方面做了大量工作。
I haven't read the literature, but Coit et al.'s "Towards faster string matching for intrusion detection or exceeding the speed of Snort" appears to be a good starting point. 我没有读过文献,但是Coit等人的“为了入侵检测更快的字符串匹配或超过Snort的速度”似乎是一个很好的起点。
Quoting from the Introduction: 引言来自简介:
The basic string matching task that must be
performed by a NIDS is to match a number of patterns drawn from the NIDS rules to
each packet or reconstructed TCP stream that the NIDS is analyzing. In Snort, the
total number of rules available has become quite large, and continues to grow
rapidly. As of 10/10/2000 there were 854 rules included in the “10102kany.rules”
ruleset file [5]. 68 of these rules did not require content matching while 786
relied on content matching to identify harmful packets. Thus, even though not
every pattern string is applied to every stream, there are a large number of
patterns being applied to some streams. For example, in traffic inbound to a web
server, Snort v 1.6.3 with the snort.org ruleset, “10102kany.rules”, checks up to
3 15 pattern strings against each packet. At the moment, it checks each pattern in
turn using the Boyer-Moore algorithm. Since the patterns often have something in
common, it seemed likely that there is considerable scope for efficiency
improvements here, and so it has proved.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.