简体   繁体   English

Python:在列表中查找包含X的项的索引

[英]Python: find index of item containing X in list

I have a huge list of data, more than 1M records in a form similar (though this is a much simpler form) to this: 我有一个庞大的数据列表,超过1M的记录形式类似(虽然这是一个更简单的形式):

[
  {'name': 'Colby Karnopp', 'ids': [441, 231, 822]}, 
  {'name': 'Wilmer Lummus', 'ids': [438, 548, 469]},
  {'name': 'Hope Teschner', 'ids': [735, 747, 488]}, 
  {'name': 'Adolfo Fenrich', 'ids': [515, 213, 120]} 
  ... 
]

Given an id of 735, I want to find the index 2 for Hope Teschner since the given id falls within the list of ids for Hope. 鉴于id为735,我想找到Hope Teschner的索引2,因为给定的id属于Hope的id列表。 What is the best (performance-wise) way to do this? 这样做的最佳(表现方式)方法是什么?

Thanks for any tips. 谢谢你的任何提示。

EDIT 编辑

Probably should have mentioned this, but an id could show up more than once. 可能应该提到这个,但一个id 可能不止一次出现。 In the case that a particular id does show up more than once, I want the lowest index for the given id. 在特定id 确实出现多次的情况下,我想要给定id的最低索引。

The data in the list will be changing frequently, so I am hesitant to go about building a dictionary since the dictionary would need to be modified / rebuilt with each update to the list since the indexes are the values in the dict - ie. 列表中的数据会经常变化,所以我对构建字典犹豫不决,因为字典需要通过每次更新列表来修改/重建,因为索引是字典中的值 - 即。 changing the position of one item in the list would require every value in the dictionary to be updated whose index is greater than the newly changed index. 更改列表中一个项目的位置将要求更新字典中的每个值,其索引大于新更改的索引。

EDIT EDIT 编辑编辑

I just did some benchmarking and it seems that rebuilding the dictionary is quite fast even for 1M + records. 我刚做了一些基准测试,看起来即使对于1M +记录,重建字典也非常快。 I think I will pursue this solution for now. 我想我现在会追求这个解决方案。

Simplest way to get the first index satisfying the condition (in Python 2.6 or better: 获得满足条件的第一个索引的最简单方法(在Python 2.6或更高版本中:

next((i for i, d in enumerate(hugelist) if 735 in d['ids']), None)

this gives None if no item satisfies the condition; 如果没有项目满足条件,则给出None ; more generally you could put as the second argument to the next built-in whatever you require in that case, or omit the second arg (and in that case you can remove one set of parentheses) if you're OK with getting a StopIteration exception when no item satisfies the condition (eg, you know that situation is impossible). 更一般地,你可以把作为第二个参数到next内置任何你需要在这种情况下,或省略第二ARG(在那种情况下,你可以删除一个括号中),如果你确定与获得StopIteration异常当没有项目满足条件时(例如,你知道情况是不可能的)。

If you need to do this kind of operation more than very few times between changes to the hugelist or its contents, then, as you indicate in the second edit to your question, building an auxiliary dict (from integer to index of first dict containing it) is preferable. 如果你需要在巨大hugelist或其内容的更改之间进行这种操作超过几次,那么,正如你在问题的第二次编辑中指出的那样,构建一个辅助字典(从整数到包含它的第一个字典的索引) )是优选的。 Since you want the first applicable index, you want to iterate backwards (so hits that are closer to the start of hugelist will override ones that are further on) -- for example: 由于您需要第一个适用的索引,因此您希望向后迭代(因此更靠近hugelist开头的hugelist将覆盖更远的那些) - 例如:

auxdict = {}
L = len(hugelist) - 1
for i, d in enumerate(reversed(hugelist)):
  auxdict.update(dict.fromkeys(d['ids'], L-i))

[[You cannot use reversed(enumerate(... because enumerate returns an iterator, not a list, and reversed is optimized to only work on a sequence argument -- whence the need for Li ]]. [[你不能使用reversed(enumerate(...因为enumerate返回一个迭代器,而不是一个列表,而reversed优化只能处理一个序列参数 - 需要Li ]]。

You can build auxdict in other ways, including without the reversal, for example: 您可以auxdict其他方式构建auxdict ,包括不进行反转,例如:

auxdict = {}
for i, d in enumerate(hugelist):
  for item in d['ids']:
    if item not in auxdict: auxdict[item] =i

but this is likely to be substantially slower due to the huge number of if that execute in the inner loop. 但是这很可能是慢得多由于数量巨大if是,在内部循环中执行。 The direct dict constructor (taking a sequence of key, value pairs) is also likely to be slower due to the need of inner loops: 由于需要内部循环,直接dict构造函数(采用一系列键,值对)也可能会变慢:

L = len(hugelist) - 1
auxdict = dict((item, L-i) for i, d in enumerate(reversed(hugelist)) for item in d['ids'])

However, these are just qualitative considerations -- consider running benchmarks over a few "typical / representative" examples of values you could have in hugelist (using timeit at the command line prompt, as I've often recommended) to measure the relative speeds of these approaches (as well as, how their runtimes compare to that of an unaided lookup as I showed at the start of this answer -- this ratio, plus the average number of lookups you expect to perform between successive hugelist changes, will help you select the overall strategy). 但是,这些仅仅是定性考虑因素 - 考虑在一些“典型/代表性”值的示例中运行基准测试,您可以在hugelist使用(在命令行提示符下使用timeit ,正如我经常建议的那样)来测量相对速度。这些方法(以及他们的运行时如何与我在此答案开始时显示的无辅助查找相比 - 此比率加上您希望在连续的hugelist更改之间执行的平均查找次数,将帮助您选择总体战略)。

Performancewise, if you have 1M records you might want to switch to a database or a different data structure. 从表面上讲,如果您有1M记录,则可能需要切换到数据库或不同的数据结构。 With the given data structure this will be a linear time operation. 使用给定的数据结构,这将是线性时间操作。 You could create an ID to records dict once though if you plan to do this query often. 如果您打算经常执行此查询,则可以创建一个ID来记录dict一次。

最好的方法可能是从ID到名称设置反向dict()。

Can two or more dicts share the same ID? 两个或更多个dicts可以共享相同的ID吗? If so, I presume you will need to return a list of indexes. 如果是这样,我认为你需要返回一个索引列表。

If you want to do a one-off search then you can do it with a list comprehension: 如果您想进行一次性搜索,那么您可以使用列表理解来完成:

>>> x = [
...   {'name': 'Colby Karnopp', 'ids': [441, 231, 822]}, 
...   {'name': 'Wilmer Lummus', 'ids': [438, 548, 469]},
...   {'name': 'Hope Teschner', 'ids': [735, 747, 488]}, 
...   {'name': 'Adolfo Fenrich', 'ids': [515, 213, 120]},
      ...
...  ]

>>> print [idx for (idx, d) in enumerate(x) if 735 in d['ids']]
[2]

However if you want to do this a lot and the list does not change much then it is much better to create an inverse index: 但是,如果你想要做很多事情并且列表没有太大变化那么创建一个反向索引要好得多:

>>> indexes = dict((id, idx) for (idx,d) in enumerate(x) for id in d['ids'])
>>> indexes
{213: 3, 515: 3, 548: 1, 822: 0, 231: 0, 488: 2, 747: 2, 469: 1, 438: 1, 120: 3, 441: 0, 735: 2}
>>> indexes[735]
2

NB: the above code assumes that each ID is unique. 注意:上面的代码假定每个ID都是唯一的。 If there are duplicates replace the dict with a collections.defaultdict(list). 如果有重复项,则用collections.defaultdict(list)替换dict。

NNB: the above code returns the index into the original list since that is what you asked for. NNB:上面的代码将索引返回到原始列表中,因为这就是您要求的内容。 However it is probably better to return the actual dict instead of the index unless you want to use the index to delete it from the list. 但是,除非您想使用索引从列表中删除它,否则最好返回实际的dict而不是索引。

If frequency of building the index is low: 如果构建索引的频率很低:

Create a lookup array of index values into your main list, such that eg 在主列表中创建索引值的查找数组,例如

lookup = [-1,-1,-1...]

...
def addtolookup
...

mainlistindex =lookup[myvalue]
if mainlistindex!=-1:
 name=mainlist[mainlistindex].name

If frwquency is high, consider the sorting approach (I think this is what is meant by the Schwartzian Transform answer). 如果频率很高,请考虑排序方法(我认为这是Schwartzian Transform答案的意思)。 This might be good if you are having more problems with the performance in rebuilding your tree whenever the source list changes than you are with performance getting the data out of the manufactured index; 如果您在源列表发生更改时重建树的性能问题比使用从制造的索引中获取数据的性能更多问题,那么这可能会很好; as slotting data into an existing list (that (crucially) knows about the other possible matches for an id for when previous best match string stops being associated with an id) will be faster than building a list from scratch on every delta. 将数据插入到现有列表中(关键地)知道当前一个最佳匹配字符串停止与id关联时id的其他可能匹配将比在每个delta上从头开始构建列表更快。

EDIT 编辑

This assumes that your IDs are densely populated integers. 这假设您的ID是密集填充的整数。

To increase performance in accessing the sorted list, it can be partitioned into blocks of say 400-600 entries to avoid repeatedly moving the entire list forwards or backwards one or a few positions, and searched with a binary algorithm. 为了提高访问排序列表的性能,可以将其划分为400-600个条目的块,以避免重复移动整个列表向前或向后移动一个或几个位置,并使用二进制算法进行搜索。

It seems that the data structure is ill-suited to its use. 似乎数据结构不适合其使用。 Changing the list is costly - both the change itself (if you do any insertions/delitions) and the resulting need to rebuild a dict, or do linear scans every time. 更改列表代价很高 - 无论是更改本身(如果进行任何插入/删除)以及由此产生的重建dict的需要,或者每次都进行线性扫描。

The question is: how is your list changing? 问题是:你的名单如何变化?

Perhaps instead of using indexes (which change frequently), you could use objects, and use pointers to the objects themselves instead of worrying about indexes? 也许不是使用索引(经常更改),而是使用对象,并使用指向对象本身而不是担心索引?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM