如何使用数百种模式在列表列表中搜索数万个项目

Question

I'm looking for advice on a better (faster) way to approach this. 我正在寻找更好（更快）的方法来解决这个问题。 My problem is that as you increase the length of the "hosts" list the program takes exponentially longer to complete, and if "hosts" is long enough it takes so long for the program to complete that it seems to just lock up. 我的问题是，随着你增加“主机”列表的长度，程序需要花费指数长的时间来完成，如果“主机”足够长，程序完成它需要很长时间，它似乎只是锁定。

"hosts" is a list of lists that contains tens of thousands of items. “hosts”是包含数万个项目的列表列表。 When iterating through "hosts" i[0] will always be an IP address, i[4] will always be a 5 digit number, and i[7] will always be a multi-line string. 当迭代“主机”时，i [0]将始终是一个IP地址，i [4]将始终是一个5位数字，i [7]将始终是一个多行字符串。
"searchPatterns" is a list of lists read in from a CSV file where elements i[0] through i[3] are regex search patterns (or the string "SKIP") and i[6] is a unique string used to identify a pattern match. “searchPatterns”是从CSV文件读入的列表列表，其中元素i [0]到i [3]是正则表达式搜索模式（或字符串“SKIP”），i [6]是用于标识a的唯一字符串模式匹配。

My current approach is to use the regex patterns from the CSV file to search through every multi-line list item contained in the "hosts" i[7] element. 我目前的方法是使用CSV文件中的正则表达式模式来搜索“hosts”i [7]元素中包含的每个多行列表项。 There are 100's of possible matches, and I need to identify all matches associated with each IP address and assign the unique string from the CSV file to identify all pattern matches. 有100个可能的匹配项，我需要识别与每个IP地址关联的所有匹配项，并从CSV文件中分配唯一字符串以标识所有模式匹配项。 Finally, I need to put that information into the "fullMatchList" to use later. 最后，我需要将该信息放入“fullMatchList”以供稍后使用。

NOTE: Even though each list item in "searchPatterns" has up to 4 patterns, I only need it to identify the first pattern found and then it can move on to the next list item to continue finding matches for that IP. 注意：即使“searchPatterns”中的每个列表项最多有4个模式，我只需要它来识别找到的第一个模式，然后它可以继续查找下一个列表项以继续查找该IP的匹配项。

for i in hosts:
    if i[4] == "13579" or i[4] == "24680":
        for j in searchPatterns:
            for k in range(4):
                if j[k] == "SKIP":
                    continue
                else:
                    match = re.search(r'%s' % j[k], i[7], flags=re.DOTALL)
                    if match is not None:
                        if tempIP == "":
                            tempIP = i[0]
                            matchListPerIP.append(j[4])
                        elif tempIP == i[0]:
                            matchListPerIP.append(j[4])
                        elif tempIP != i[0]:
                            fullMatchList.append([tempIP, matchListPerIP])
                            tempIP = i[0]
                            matchListPerIP = []
                            matchListPerIP.append(j[4])
                        break
fullMatchList.append([tempIP, matchListPerIP])

Here's an example regex search pattern from the CSV file: 以下是CSV文件中的示例正则表达式搜索模式：
(?!(.*?)\\br2\\b)cpe:/o:microsoft:windows_server_2008:

That pattern is intended to identify Windows Server 2008, and includes a negative lookahead to avoid matching the R2 edition. 该模式旨在识别Windows Server 2008，并包含一个负向前瞻以避免匹配R2版本。

I'm new to Python so any advice is appreciated! 我是Python的新手，所以任何建议都表示赞赏！ Thank you! 谢谢！

Answer 1

The NIDS community has done a lot of work on testing the same string(s) (network packets) against a long list of regexes (firewall rules). NIDS社区在针对一长串正则表达式（防火墙规则）测试相同的字符串（网络数据包）方面做了大量工作。

I haven't read the literature, but Coit et al.'s "Towards faster string matching for intrusion detection or exceeding the speed of Snort" appears to be a good starting point. 我没有读过文献，但是Coit等人的“为了入侵检测更快的字符串匹配或超过Snort的速度”似乎是一个很好的起点。

Quoting from the Introduction: 引言来自简介：

The basic string matching task that must be
performed by a NIDS is to match a number of patterns drawn from the NIDS rules to 
each packet or reconstructed TCP stream that the NIDS is analyzing. In Snort, the 
total number of rules available has become quite large, and continues to grow 
rapidly. As of 10/10/2000 there were 854 rules included in the “10102kany.rules” 
ruleset file [5]. 68 of these rules did not require content matching while 786 
relied on content matching to identify harmful packets. Thus, even though not 
every pattern string is applied to every stream, there are a large number of 
patterns being applied to some streams. For example, in traffic inbound to a web 
server, Snort v 1.6.3 with the snort.org ruleset, “10102kany.rules”, checks up to 
3 15 pattern strings against each packet. At the moment, it checks each pattern in 
turn using the Boyer-Moore algorithm. Since the patterns often have something in 
common, it seemed likely that there is considerable scope for efficiency 
improvements here, and so it has proved.

如何使用数百种模式在列表列表中搜索数万个项目

问题描述

1 个解决方案

解决方案1
0 2019-03-12 13:27:26

如何使用数百种模式在列表列表中搜索数万个项目

问题描述

1 个解决方案

解决方案1 0 2019-03-12 13:27:26

解决方案1
0 2019-03-12 13:27:26