如何使用数百种模式在列表列表中搜索数万个项目

Question

我正在寻找更好（更快）的方法来解决这个问题。 我的问题是，随着你增加“主机”列表的长度，程序需要花费指数长的时间来完成，如果“主机”足够长，程序完成它需要很长时间，它似乎只是锁定。

“hosts”是包含数万个项目的列表列表。 当迭代“主机”时，i [0]将始终是一个IP地址，i [4]将始终是一个5位数字，i [7]将始终是一个多行字符串。
“searchPatterns”是从CSV文件读入的列表列表，其中元素i [0]到i [3]是正则表达式搜索模式（或字符串“SKIP”），i [6]是用于标识a的唯一字符串模式匹配。

我目前的方法是使用CSV文件中的正则表达式模式来搜索“hosts”i [7]元素中包含的每个多行列表项。 有100个可能的匹配项，我需要识别与每个IP地址关联的所有匹配项，并从CSV文件中分配唯一字符串以标识所有模式匹配项。 最后，我需要将该信息放入“fullMatchList”以供稍后使用。

注意：即使“searchPatterns”中的每个列表项最多有4个模式，我只需要它来识别找到的第一个模式，然后它可以继续查找下一个列表项以继续查找该IP的匹配项。

for i in hosts:
    if i[4] == "13579" or i[4] == "24680":
        for j in searchPatterns:
            for k in range(4):
                if j[k] == "SKIP":
                    continue
                else:
                    match = re.search(r'%s' % j[k], i[7], flags=re.DOTALL)
                    if match is not None:
                        if tempIP == "":
                            tempIP = i[0]
                            matchListPerIP.append(j[4])
                        elif tempIP == i[0]:
                            matchListPerIP.append(j[4])
                        elif tempIP != i[0]:
                            fullMatchList.append([tempIP, matchListPerIP])
                            tempIP = i[0]
                            matchListPerIP = []
                            matchListPerIP.append(j[4])
                        break
fullMatchList.append([tempIP, matchListPerIP])

以下是CSV文件中的示例正则表达式搜索模式：
(?!(.*?)\\br2\\b)cpe:/o:microsoft:windows_server_2008:

该模式旨在识别Windows Server 2008，并包含一个负向前瞻以避免匹配R2版本。

我是Python的新手，所以任何建议都表示赞赏！ 谢谢！

Answer 1

NIDS社区在针对一长串正则表达式（防火墙规则）测试相同的字符串（网络数据包）方面做了大量工作。

我没有读过文献，但是Coit等人的“为了入侵检测更快的字符串匹配或超过Snort的速度”似乎是一个很好的起点。

引言来自简介：

The basic string matching task that must be
performed by a NIDS is to match a number of patterns drawn from the NIDS rules to 
each packet or reconstructed TCP stream that the NIDS is analyzing. In Snort, the 
total number of rules available has become quite large, and continues to grow 
rapidly. As of 10/10/2000 there were 854 rules included in the “10102kany.rules” 
ruleset file [5]. 68 of these rules did not require content matching while 786 
relied on content matching to identify harmful packets. Thus, even though not 
every pattern string is applied to every stream, there are a large number of 
patterns being applied to some streams. For example, in traffic inbound to a web 
server, Snort v 1.6.3 with the snort.org ruleset, “10102kany.rules”, checks up to 
3 15 pattern strings against each packet. At the moment, it checks each pattern in 
turn using the Boyer-Moore algorithm. Since the patterns often have something in 
common, it seemed likely that there is considerable scope for efficiency 
improvements here, and so it has proved.

如何使用数百种模式在列表列表中搜索数万个项目

问题描述

1 个解决方案

解决方案1
0 2019-03-12 13:27:26

如何使用数百种模式在列表列表中搜索数万个项目

问题描述

1 个解决方案

解决方案1 0 2019-03-12 13:27:26

解决方案1
0 2019-03-12 13:27:26