总结Python中不同类的几个大型数据结构；如何在减少内存使用的同时合并和存储所需的数据？

Question

What's going on 这是怎么回事

I'm collecting data from a few thousand network devices every few minutes in Python 2.7.8 via package netsnmp . 我使用包netsnmp在Python 2.7.8中每隔几分钟从几千个网络设备中收集数据。 I'm also using fastsnmpy so that I can access the (more efficient) Net-SNMP command snmpbulkwalk . 我还使用fastsnmpy以便可以访问（更有效的）Net-SNMP命令snmpbulkwalk 。

I'm trying to cut down how much memory my script uses. 我试图减少脚本使用的内存量。 I'm running three instances of the same script which sleeps for two minutes before re-querying all devices for data we want. 我正在运行同一脚本的三个实例，该实例休眠两分钟，然后重新查询所有设备以获取所需数据。 When I created the original script in bash they would use less than 500MB when active simultaneously. 当我在bash创建原始脚本时，同时激活它们将使用不到500MB的内存。 As I've converted this over to Python, however, each instance hogs 4GB each which indicates (to me) that my data structures need to be managed more efficiently. 正如我已经转换这种过度到Python，但是，每个实例猪4GB 每个指示（对我），我的数据结构需要更有效的管理。 Even when idle they're consuming a total of 4GB. 即使闲置时，它们也要消耗4GB的内存。

Code Activity 代码活动

My script begins with creating a list where I open a file and append the hostname of our target devices as separate values. 我的脚本首先创建一个列表，在其中打开文件，然后将目标设备的主机名附加为单独的值。 These usually contain 80 to 1200 names. 这些名称通常包含80至1200个名称。

expand = []
f = open(self.deviceList, 'r')
for line in f:
    line = line.strip()
    expand.append(line)

From there I set up the SNMP sessions and execute the requests 从那里，我建立了SNMP会话并执行请求

expandsession = SnmpSession ( timeout = 1000000 ,
    retries = 1,            # I slightly modified the original fastsnmpy
    verbose = debug,        # to reduce verbose messages and limit
    oidlist = var,          # the number of attempts to reach devices
    targets = expand,
    community = 'expand'
)
expandresults = expandsession.multiwalk(mode = 'bulkwalk')

Because of how both SNMP packages behave, the device responses are parsed up into lists and stored into one giant data structure. 由于这两个SNMP软件包的行为方式，设备响应被解析为列表并存储为一个巨型数据结构。 For example, 例如，

for output in expandresults:
    print ouput.hostname, output.iid, output.val
#
host1 1 1
host1 2 2
host1 3 3
host2 1 4
host2 2 5
host2 3 6
# Object 'output' itself cannot be printed directly; the value returned from this is obscure
...

I'm having to iterate through each response, combine related data, then output each device's complete response. 我必须遍历每个响应，合并相关数据，然后输出每个设备的完整响应。 This is a bit difficult For example, 这有点困难，例如，

host1,1,2,3
host2,4,5,6
host3,7,8,9,10,11,12
host4,13,14
host5,15,16,17,18
...

Each device has a varying number of responses. 每个设备都有不同数量的响应。 I can't loop through expecting every device having a uniform arbitrary number of values to combine into a string to write out to a CSV. 我无法期望所有具有统一任意数量值的设备组合成一个字符串以写出CSV。

How I'm handling the data 我如何处理数据

I believe it is here where I'm consuming a lot of memory but I cannot resolve how to simplify the process while simultaneously removing visited data. 我认为这是我要消耗大量内存的地方，但是我无法解决如何在简化过程的同时删除访问的数据。

expandarrays = dict()
for output in expandresults:
    if output.val is not None:
        if output.hostname in expandarrays:
            expandarrays[output.hostname] += ',' + output.val
        else:
            expandarrays[output.hostname] = ',' + output.val

for key in expandarrays:
    self.WriteOut(key,expandarrays[key])

Currently I'm creating a new dictionary, checking that the device response is not null, then appending the response value to a string that will be used to write out to the CSV file. 当前，我正在创建一个新的词典，检查设备响应是否不为null，然后将响应值附加到将用于写出CSV文件的字符串中。

The problem with this is that I'm essentially cloning the existing dictionary, meaning I'm using twice as much system memory. 问题在于，我实际上是在克隆现有字典，这意味着我使用的系统内存是原来的两倍。 I'd like to remove values that I've visited in expandresults when I move them to expandarrays so that I'm not using so much RAM. 我想将我在expandresults访问过的值删除，因为我将它们移到expandarrays这样我就不会使用太多的RAM。 Is there an efficient method of doing this? 有有效的方法吗？ Is there also a better way of reducing the complexity of my code so that it's easier to follow? 还有降低我的代码复杂性的更好方法，以便于遵循吗？

The Culprit 罪犯

Thanks to those who answered. 感谢那些回答。 For those in the future that stumble across this thread due to experiencing similar issues: the fastsnmpy package is the culprit behind the large use of system memory. 对于将来由于遇到类似问题而fastsnmpy发现此线程的用户： fastsnmpy软件包是大量使用系统内存的罪魁祸首。 The multiwalk() function creates a thread for each host but does so all at once rather than putting some kind of upper limit. multiwalk()函数为每个主机创建一个线程，但是一次执行所有操作，而不是设置某种上限。 Since each instance of my script would handle up to 1200 devices that meant 1200 threads were instantiated and queued within just a few seconds. 由于脚本的每个实例最多可以处理1200个设备，因此仅在几秒钟内就实例化了1200个线程并使其排队。 Using the bulkwalk() function was slower but still fast enough to suit my needs. 使用bulkwalk()函数比较慢，但仍然足够快以满足我的需求。 The difference between the two was 4GB vs 250MB (of system memory use). 两者之间的差异为4GB与250MB（系统内存使用量）。

Answer 1

If the device responses are in order and are grouped together by host, then you don't need a dictionary, just three lists: 如果设备响应是按顺序排列的，并且按主机分组在一起，则您不需要字典，只需三个列表：

last_host = None
hosts = []                # the list of hosts
host_responses = []       # the list of responses for each host
responses = []
for output in expandresults:
    if output.val is not None:
        if output.hostname != last_host:    # new host
            if last_host:    # only append host_responses after a new host
                host_responses.append(responses)
            hosts.append(output.hostname)
            responses = [output.val]        # start the new list of responses
            last_host = output.hostname
        else:                               # same host, append the response
            responses.append(output.val)
host_responses.append(responses)

for host, responses in zip(hosts, host_responses):
    self.WriteOut(host, ','.join(responses))

Answer 2

The memory consumption was due to instantiation of several workers in an unbound manner. 内存消耗是由于以无约束方式实例化多个工作程序所致。

I've updated fastsnmpy (latest is version 1.2.1 ) and uploaded it to PyPi. 我更新了fastsnmpy（最新版本为1.2.1）并将其上传到PyPi。 You can do a search from PyPi for 'fastsnmpy', or grab it directly from my PyPi page here at FastSNMPy 您可以从PyPi中搜索“ fastsnmpy”，也可以直接从我的PyPi页面上的FastSNMPy上进行抓取。

Just finished updating the docs, and posted them to the project page at fastSNMPy DOCS 刚刚完成文档更新，并将其发布到fastSNMPy DOCS的项目页面上

What I basically did here is to replace the earlier model of unbound-workers with a process-pool from multiprocessing. 我在这里所做的基本上是用多处理中的一个处理池替换早期的无约束工人模型。 This can be passed in as an argument, or defaults to 1. 可以将其作为参数传递，或默认为1。

You now have just 2 methods for simplicity. 为了简便起见，您现在只有2种方法。 snmpwalk(processes=n) and snmpbulkwalk(processes=n) snmpwalk（processes = n）和snmpbulkwalk（processes = n）

You shouldn't see the memory issue anymore. 您不应该再看到内存问题。 If you do, please ping me on github. 如果您这样做，请在github上ping我。

Answer 3

You might have an easier time figuring out where the memory is going by using a profiler: 通过使用探查器，您可能会更轻松地确定内存的去向：

https://pypi.python.org/pypi/memory_profiler https://pypi.python.org/pypi/memory_profiler

Additionally, if you're already already tweaking the fastsnmpy classes, you can just change the implementation to do the dictionary based results merging for you instead of letting it construct a gigantic list first. 此外，如果您已经调整过fastsnmpy类，则可以更改实现以为您合并基于字典的结果，而不是先让它构造一个巨大的列表。

How long are you hanging on to the session? 您要挂接多长时间？ The result list will grow indefinitely if you reuse it. 如果重复使用，结果列表将无限期增长。

总结Python中不同类的几个大型数据结构；如何在减少内存使用的同时合并和存储所需的数据？

问题描述

What's going on 这是怎么回事

Code Activity 代码活动

How I'm handling the data 我如何处理数据

The Culprit 罪犯

3 个解决方案

解决方案1
1 已采纳 2015-07-07 00:11:00

解决方案2
1 2016-01-24 22:04:01

解决方案3
0 2015-07-07 00:17:18

总结Python中不同类的几个大型数据结构； 如何在减少内存使用的同时合并和存储所需的数据？

问题描述

What's going on 这是怎么回事

Code Activity 代码活动

How I'm handling the data 我如何处理数据

The Culprit 罪犯

3 个解决方案

解决方案1 1 已采纳 2015-07-07 00:11:00

解决方案2 1 2016-01-24 22:04:01

解决方案3 0 2015-07-07 00:17:18

总结Python中不同类的几个大型数据结构；如何在减少内存使用的同时合并和存储所需的数据？

解决方案1
1 已采纳 2015-07-07 00:11:00

解决方案2
1 2016-01-24 22:04:01

解决方案3
0 2015-07-07 00:17:18