简体   繁体   English

Python 字典,多个键指向 memory 有效方式中的同一列表

[英]Python dictionary with multiple keys pointing to same list in memory efficient way

I have this unique requirement which can be explained by this code.我有这个独特的要求,可以用这段代码来解释。 This is working code but not memory efficient.这是有效的代码,但 memory 效率不高。

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]

temp_dict = {
    item: index for index, sublist in enumerate(data)
        for item in sublist
}

print(data[temp_dict["A 2003529"]])

out: ['A 5408599', 'B 8126880', 'A 2003529']

In short, I want each item of sub-list to be indexable and should return the sublist.简而言之,我希望子列表的每个项目都是可索引的,并且应该返回子列表。

The above method works but It takes a lot of memory when data is large.上述方法有效,但是当数据很大时,它需要大量的 memory。 Is there any better, memory and CPU friendly way?有没有更好的 memory 和 CPU 友好的方式? The data is stored as a JSON file.数据存储为 JSON 文件。

Edit I tried the answers for the largest possible use case scenario (1000 sublist, 100 items in each sublist, 1 million queries) and here are results (mean of 10 runs):编辑我尝试了最大可能用例场景的答案(1000 个子列表,每个子列表中有 100 个项目,100 万个查询),这里是结果(10 次运行的平均值):

Method,    Time (seconds),    Extra Memory used
my,        0.637              40 Mb
deceze,    0.63               40 Mb
James,     0.78               200 kb
Pant,      > 300              0 kb
mcsoini,   forever            0 kb

You can try something like this:你可以尝试这样的事情:

list(filter(lambda x: any(["C 9772980" in x]),data))

No need to make a mapping structure.无需制作映射结构。

You are really in a trade-off space between the time/memory it takes to generate the dictionary versus the time it takes to scan the entire data for an on-the-fly method.您实际上是在生成字典所需的时间/内存与扫描整个数据以查找动态方法所需的时间之间进行权衡。

If you want a low memory method, you can use a function that searches each sublist for the value.如果您想要一个低 memory 方法,您可以使用 function 搜索每个子列表的值。 Using a generator will get initial results faster to the user, but for large data sets, this will be slow between returns.使用生成器将更快地为用户获得初始结果,但对于大型数据集,这在返回之间会很慢。

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]


def find_list_by_value(v, data):
    for sublist in data:
        if v in sublist:
            yield sublist

for s in find_list_by_value("C 9772980", data):
    print(s)

As mentioned in the comments, building a hash table based just on the first letter or first 2 or 3 character might be a good place to start.如评论中所述,仅基于第一个字母或前 2 或 3 个字符构建 hash 表可能是一个不错的起点。 This will allow you to build a candidate list of sublists, then scan those to see if the value is in the sublist.这将允许您构建子列表的候选列表,然后扫描它们以查看该值是否在子列表中。

from collections import defaultdict

def get_key(v, size=3):
    return v[:size]

def get_keys(sublist, size=3):
    return set(get_key(v, size) for v in sublist)

def find_list_by_hash(v, data, hash_table, size=3):
    key = get_key(v, size)
    candidate_indices = hash_table.get(key, set())
    for ix in candidates:
        if v in data[ix]:
            yield data[ix]

# generate the small hash table
quick_hash = defaultdict(set)
for i, sublist in enumerate(data):
    for k in get_keys(sublist, 3):
        quick_hash[k].add(i)

# lookup a value by the small hash
for s in find_list_by_hash("C 9772980", data, quick_hash, 3):
    print(s)

In this code quick_hash will take some time to build, because you are scanning your entire data structure.在此代码quick_hash需要一些时间来构建,因为您正在扫描整个数据结构。 However, the memory foot print will be much smaller.但是,memory 占用空间会小很多。 You main parameter for tuning performance is size .您调整性能的主要参数是size Smaller size will have a smaller memory footprint, but will take longer when running find_list_by_hash because your candidate pool will be larger.较小的尺寸将具有较小的 memory 占用空间,但在运行find_list_by_hash时会花费更长的时间,因为您的候选池会更大。 You can do some testing to see what the right size should be for your data.您可以进行一些测试以查看适合您的数据的size Just be mindful that all of your values are at least as long as size .请注意,您的所有值都至少size一样长。

try this, using pandas试试这个,使用 pandas

import pandas as pd
df=pd.DataFrame(data)
rows = df.shape[0]
for row in range(rows):
    print[[row]]    #Do something with your data

this looks simple solution, even if your data grows big, this will handle that efficiently这看起来很简单的解决方案,即使您的数据变大,这也会有效地处理

I'm not entirely sure how this would behave for larger amounts data, but you could try something along the lines of:我不完全确定这对于大量数据会如何表现,但您可以尝试以下方式:

import pandas as pd
df = pd.DataFrame(data).T
df.loc[:, (df == 'A 2003529').any(axis=0)]
Out[39]: 
           0
0  A 5408599
1  B 8126880
2  A 2003529
3       None
4       None
5       None
6       None

Edit: Does not seem to be beneficial in terms of time, based on a quick test with some fake larger scale data.编辑:基于对一些假的更大规模数据的快速测试,在时间方面似乎没有好处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM