Python 字典，多个键指向 memory 有效方式中的同一列表

Question

I have this unique requirement which can be explained by this code.我有这个独特的要求，可以用这段代码来解释。 This is working code but not memory efficient.这是有效的代码，但 memory 效率不高。

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]

temp_dict = {
    item: index for index, sublist in enumerate(data)
        for item in sublist
}

print(data[temp_dict["A 2003529"]])

out: ['A 5408599', 'B 8126880', 'A 2003529']

In short, I want each item of sub-list to be indexable and should return the sublist.简而言之，我希望子列表的每个项目都是可索引的，并且应该返回子列表。

The above method works but It takes a lot of memory when data is large.上述方法有效，但是当数据很大时，它需要大量的 memory。 Is there any better, memory and CPU friendly way?有没有更好的 memory 和 CPU 友好的方式？ The data is stored as a JSON file.数据存储为 JSON 文件。

Edit I tried the answers for the largest possible use case scenario (1000 sublist, 100 items in each sublist, 1 million queries) and here are results (mean of 10 runs):编辑我尝试了最大可能用例场景的答案（1000 个子列表，每个子列表中有 100 个项目，100 万个查询），这里是结果（10 次运行的平均值）：

Method,    Time (seconds),    Extra Memory used
my,        0.637              40 Mb
deceze,    0.63               40 Mb
James,     0.78               200 kb
Pant,      > 300              0 kb
mcsoini,   forever            0 kb

Answer 1

You can try something like this:你可以尝试这样的事情：

list(filter(lambda x: any(["C 9772980" in x]),data))

No need to make a mapping structure.无需制作映射结构。

Answer 2

You are really in a trade-off space between the time/memory it takes to generate the dictionary versus the time it takes to scan the entire data for an on-the-fly method.您实际上是在生成字典所需的时间/内存与扫描整个数据以查找动态方法所需的时间之间进行权衡。

If you want a low memory method, you can use a function that searches each sublist for the value.如果您想要一个低 memory 方法，您可以使用 function 搜索每个子列表的值。 Using a generator will get initial results faster to the user, but for large data sets, this will be slow between returns.使用生成器将更快地为用户获得初始结果，但对于大型数据集，这在返回之间会很慢。

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]


def find_list_by_value(v, data):
    for sublist in data:
        if v in sublist:
            yield sublist

for s in find_list_by_value("C 9772980", data):
    print(s)

As mentioned in the comments, building a hash table based just on the first letter or first 2 or 3 character might be a good place to start.如评论中所述，仅基于第一个字母或前 2 或 3 个字符构建 hash 表可能是一个不错的起点。 This will allow you to build a candidate list of sublists, then scan those to see if the value is in the sublist.这将允许您构建子列表的候选列表，然后扫描它们以查看该值是否在子列表中。

from collections import defaultdict

def get_key(v, size=3):
    return v[:size]

def get_keys(sublist, size=3):
    return set(get_key(v, size) for v in sublist)

def find_list_by_hash(v, data, hash_table, size=3):
    key = get_key(v, size)
    candidate_indices = hash_table.get(key, set())
    for ix in candidates:
        if v in data[ix]:
            yield data[ix]

# generate the small hash table
quick_hash = defaultdict(set)
for i, sublist in enumerate(data):
    for k in get_keys(sublist, 3):
        quick_hash[k].add(i)

# lookup a value by the small hash
for s in find_list_by_hash("C 9772980", data, quick_hash, 3):
    print(s)

In this code quick_hash will take some time to build, because you are scanning your entire data structure.在此代码quick_hash需要一些时间来构建，因为您正在扫描整个数据结构。 However, the memory foot print will be much smaller.但是，memory 占用空间会小很多。 You main parameter for tuning performance is size .您调整性能的主要参数是size 。 Smaller size will have a smaller memory footprint, but will take longer when running find_list_by_hash because your candidate pool will be larger.较小的尺寸将具有较小的 memory 占用空间，但在运行find_list_by_hash时会花费更长的时间，因为您的候选池会更大。 You can do some testing to see what the right size should be for your data.您可以进行一些测试以查看适合您的数据的size 。 Just be mindful that all of your values are at least as long as size .请注意，您的所有值都至少与size一样长。

Answer 3

try this, using pandas试试这个，使用 pandas

import pandas as pd
df=pd.DataFrame(data)
rows = df.shape[0]
for row in range(rows):
    print[[row]]    #Do something with your data

this looks simple solution, even if your data grows big, this will handle that efficiently这看起来很简单的解决方案，即使您的数据变大，这也会有效地处理

Answer 4

I'm not entirely sure how this would behave for larger amounts data, but you could try something along the lines of:我不完全确定这对于大量数据会如何表现，但您可以尝试以下方式：

import pandas as pd
df = pd.DataFrame(data).T
df.loc[:, (df == 'A 2003529').any(axis=0)]
Out[39]: 
           0
0  A 5408599
1  B 8126880
2  A 2003529
3       None
4       None
5       None
6       None

Edit: Does not seem to be beneficial in terms of time, based on a quick test with some fake larger scale data.编辑：基于对一些假的更大规模数据的快速测试，在时间方面似乎没有好处。

Python 字典，多个键指向 memory 有效方式中的同一列表

问题描述

4 个解决方案

解决方案1
2 2019-11-14 08:48:13

解决方案2
2 已采纳 2019-11-14 09:18:06

解决方案3
2 2019-11-14 09:19:10

解决方案4
0 2019-11-14 09:18:16

Python 字典，多个键指向 memory 有效方式中的同一列表

问题描述

4 个解决方案

解决方案1 2 2019-11-14 08:48:13

解决方案2 2 已采纳 2019-11-14 09:18:06

解决方案3 2 2019-11-14 09:19:10

解决方案4 0 2019-11-14 09:18:16

解决方案1
2 2019-11-14 08:48:13

解决方案2
2 已采纳 2019-11-14 09:18:06

解决方案3
2 2019-11-14 09:19:10

解决方案4
0 2019-11-14 09:18:16