简体   繁体   English

根据列表快速排序(从中提取)元组

[英]fast ordered sort (and extract from) tuple based on list

I am a python newbie and I have been trying to sort (and extract) values from a tuple based on values on a list, but so far, my code seems really slow. 我是python新手,我一直在尝试根据列表中的值对元组中的值进行排序(提取),但是到目前为止,我的代码似乎真的很慢。

So, I have a list like so: 所以,我有一个像这样的列表:

x = ["d5b44796d43c4bf5a0f252aeb49738f5", "04d0e11f8ceb4b128fa723181369ba1a", "6244dd8bfee44a61800a25d9f2e6f743", "662ae26640a44a37816daa6e85ef4972", "7d5e1f59f7984495877a059bea643954"]

the, I have a tuple like so: 的,我有一个像这样的元组:

y = [(31, u'dir/04d0e11f8ceb4b128fa723181369ba1a.mov'), (32, u'dir/d5b44796d43c4bf5a0f252aeb49738f5.pdf'), (66, u'dir/6244dd8bfee44a61800a25d9f2e6f743.jpg'), (34, u'dir/662ae26640a44a37816daa6e85ef4972.doc'), (33, u'dir/7d5e1f59f7984495877a059bea643954.ppt')]

I would like to get the id from y if the element in x is present in y[i][1] . 如果x的元素存在于y[i][1]我想从y获取id So, something like this: 因此,如下所示:

id_list=[]
for i in x:
    for j in y:
        if i in j[1]:
            try:
                id_list.append(j[0])
            except:
                pass
            break
        else:
            pass

I get: 我得到:

id_list = [32, 31, 66, 34, 33]

Also, the result set has to maintain the order in x . 同样,结果集必须保持x的顺序。 The above loop does this. 上面的循环可以做到这一点。

The problem is that the above code is very slow (ashamed of it!) - my x is in 1000's and so is y . 问题是上面的代码非常慢(羞愧!)-我的x在1000的范围内,而y也在。

So I guess my question is if there a better way to write the above code? 所以我想我的问题是是否有更好的方法编写上述代码? I was thinking iterators here but was not entirely sure how to write one in this case. 我在这里正在考虑迭代器,但并不确定在这种情况下如何编写迭代器。

id_list = [j[0] for j in sorted(y, key=lambda e: x.index(e[1].split('/')[-1].split('.')[0]))]    

This can be improved if x was a dict since lookup will be faster, so we'll use OrderedDict to maintain the order: 如果x是dict则可以改进此方法,因为查找会更快,因此我们将使用OrderedDict来维护顺序:

import collections
from os.path import basename, splitext

x = collections.OrderedDict((e, i) for i, e in enumerate(x))

id_list = [j[0] for j in sorted(y, key=lambda e: x[splitext(basename(e[1]))[0]])]
In [3]y1=[elem[1].strip('dir').split('.')[0] for elem in y]
In [4]: res=[(i,j[0]) for i in x for j in y1 if i in j ]

In [5]: res
Out[5]: 
[('04d0e11f8ceb4b128fa723181369ba1a', 31),
 ('6244dd8bfee44a61800a25d9f2e6f743', 66),
 ('662ae26640a44a37816daa6e85ef4972', 34),
 ('7d5e1f59f7984495877a059bea643954', 33)]

In [6]: [elem[1] for elem in res]
Out[6]: [31, 66, 34, 33]

If you want to maintain the order in x, you need to extract all ids in y and put them in a set, then iterator over x to check whether an item is in the set: 如果要保持x的顺序,则需要提取y所有id并将它们放在集合中,然后在x上进行迭代以检查项是否在集合中:

>>> x = ["d5b44796d43c4bf5a0f252aeb49738f5", "04d0e11f8ceb4b128fa723181369ba1a", "6244dd8bfee44a61800a25d9f2e6f743", "662ae26640a44a37816daa6e85ef4972", "7d5e1f59f7984495877a059bea643954"]
>>> y = [(31, u'dir/04d0e11f8ceb4b128fa723181369ba1a.mov'), (32, u'dir/d5b44796d43c4bf5a0f252aeb49738f5.pdf'), (66, u'dir/6244dd8bfee44a61800a25d9f2e6f743.jpg'), (34, u'dir/662ae26640a44a37816daa6e85ef4972.doc'), (33, u'dir/7d5e1f59f7984495877a059bea643954.ppt')]
>>> s = set()
>>> for e in y:
...     r = re.match(r'^dir/(.*)\.', e[1])
...     if r:
...             s.add(r.group(1))
>>> [e for e in x if e in s]
x = ["d5b44796d43c4bf5a0f252aeb49738f5", "04d0e11f8ceb4b128fa723181369ba1a", "6244dd8bfee44a61800a25d9f2e6f743", "662ae26640a44a37816daa6e85ef4972", "7d5e1f59f7984495877a059bea643954"]

xset = set(x)

y = [(31, u'dir/04d0e11f8ceb4b128fa723181369ba1a.mov'), (32, u'dir/d5b44796d43c4bf5a0f252aeb49738f5.pdf'), (66, u'dir/6244dd8bfee44a61800a25d9f2e6f743.jpg'), (34, u'dir/662ae26640a44a37816daa6e85ef4972.doc'), (33, u'dir/7d5e1f59f7984495877a059bea643954.ppt')]

print [num for num, path in y if path.split('/')[1].split('.')[0] in xset]

In this answer : use [:-4] may not be a good idea, what if we have a dir/04d0e11f8ceb4b128fa723181369ba1a.rmvb ? 在这个答案中 :使用[:-4]可能不是一个好主意,如果我们有dir/04d0e11f8ceb4b128fa723181369ba1a.rmvb怎么dir/04d0e11f8ceb4b128fa723181369ba1a.rmvb I'd suggest using os.path.splitext(os.path.basename(thefilepath))[0] to get the file name. 我建议使用os.path.splitext(os.path.basename(thefilepath))[0]来获取文件名。

so my idea is: we map the element to the id first, yy should be: 所以我的想法是:我们首先将元素映射到id, yy应该是:

{u'7d5e1f59f7984495877a059bea643954': 33,u'6244dd8bfee44a61800a25d9f2e6f743': 66, u'662ae26640a44a37816daa6e85ef4972': 34, u'04d0e11f8ceb4b128fa723181369ba1a': 31, u'd5b44796d43c4bf5a0f252aeb49738f5': 32}

and the we get the id using yy[element] , and the order should be as before. 并且我们使用yy[element]获得ID,其顺序应与以前相同。


The solution: 解决方案:

from os import path

yy = {path.splitext(path.basename(j))[0]:i for (i, j) in y}
xx = [yy[i] for i in x]
print(xx)

# output
[32, 31, 66, 34, 33]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM