[英]Remove certain consecutive duplicates in list
我有一个像这样的字符串列表:
['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
我想用'**', '**'
替换'**', '**'
'**'
,但保留'bar', 'bar'
完好无损。 即用一个替换任何连续数字的'**'
。 我当前的代码如下所示:
p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
np = [p[0]]
for pi in range(1,len(p)):
if p[pi] == '**' and np[-1] == '**':
continue
np.append(p[pi])
有没有更多的pythonic方式来做到这一点?
不确定pythonic,但这应该工作,更简洁:
star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
star_list = [i for i, next_i in zip(star_list, star_list[1:] + [None])
if (i, next_i) != ('**', '**')]
以上复制列表两次; 如果你想避免这种情况,请考虑Tom Zych的方法。 或者,您可以执行以下操作:
from itertools import islice, izip, chain
star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
sl_shift = chain(islice(star_list, 1, None), [None])
star_list = [i for i, next_i in izip(star_list, sl_shift)
if (i, next_i) != ('**', '**')]
这可以推广并使迭代器更友好 - 更不用说更易读了 - 使用itertools
文档中的pairwise
配方的变体:
from itertools import islice, izip, chain, tee
def compress(seq, x):
seq, shift = tee(seq)
shift = chain(islice(shift, 1, None), (object(),))
return (i for i, j in izip(seq, shift) if (i, j) != (x, x))
测试:
>>> list(compress(star_list, '**'))
['**', 'foo', '*', 'bar', 'bar', '**', 'baz']
这在我看来是pythonic
result = [v for i, v in enumerate(L) if L[i:i+2] != ["**", "**"]]
唯一使用的“技巧”是当i == len(L)-1
时L[i:i+2]
是一个元素的列表。
请注意,当然表达式也可以用作生成器
这有效。 不确定Pythonic是怎么回事。
import itertools
p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
q = []
for key, iter in itertools.groupby(p):
q.extend([key] * (1 if key == '**' else len(list(iter))))
print(q)
from itertools import groupby
p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
keep = set(['foo', 'bar', 'baz'])
result = []
for k, g in groupby(p):
if k in keep:
result.extend(list(g))
else:
result.append(k)
没有itertools.groupby()
解决方案:
p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]
def treat(A):
prec = A[0]; yield prec
for x in A[1:]:
if (prec,x)!=('**','**'): yield x
prec = x
print p
print
print list(treat(p))
结果
['**', 'foo', '*', 'bar', 'bar', '**', '**', '**',
'baz', '**', '**',
'foo', '*', '*', 'bar', 'bar','bar', '**', '**',
'foo', 'bar']
['**', 'foo', '*', 'bar', 'bar', '**',
'baz', '**',
'foo', '*', '*', 'bar', 'bar', 'bar', '**',
'foo', 'bar']
另一个解决方案,灵感来自dugres
from itertools import groupby
p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]
res = []
for k, g in groupby(p):
res.extend( ['**'] if k=='**' else list(g) )
print res
这就像Tom Zych的解决方案,但更简单
。
p = ['**','**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar', '**', '**', '**']
q= ['**',12,'**',45, 'foo',78, '*',751, 'bar',4789, 'bar',3, '**', 5,'**',7, '**',
73,'baz',4, '**',8, '**',20,'foo', 8,'*',36,'*', 36,'bar', 11,'bar',0,'bar',9,
'**', 78,'**',21,'foo',27,'bar',355, '**',33, '**',37, '**','end']
def treat(B,dedupl):
B = iter(B)
prec = B.next(); yield prec
for x in B:
if not(prec==x==dedupl): yield x
prec = x
print 'gen = ( x for x in q[::2])'
gen = ( x for x in q[::2])
print 'list(gen)==p is ',list(gen)==p
gen = ( x for x in q[::2])
print 'list(treat(gen)==',list(treat(gen,'**'))
ch = '??h4i4???4t4y?45l????hmo4j5???'
print '\nch==',ch
print "''.join(treat(ch,'?'))==",''.join(treat(ch,'?'))
print "\nlist(treat([],'%%'))==",list(treat([],'%%'))
结果
gen = ( x for x in q[::2])
list(gen)==p is True
list(treat(gen)== ['**', 'foo', '*', 'bar', 'bar', '**', 'baz', '**', 'foo', '*', '*', 'bar', 'bar', 'bar', '**', 'foo', 'bar', '**']
ch== ??h4i4???4t4y?45l????hmo4j5???
''.join(treat(ch,'?'))== ?h4i4?4t4y?45l?hmo4j5?
list(treat([],'%%'))== []
。
备注:生成器函数允许通过围绕生成器调用的写入使输出适应输入类型,不需要更改生成器函数的内部代码;
Tom Zynch的解决方案并非如此,这种解决方案无法如此轻松地适应输入类型
。
我使用列表推导或生成器表达式搜索了单行方法。
我找到了这样做的方法,我认为没有groupby()是不可能做到的
from itertools import groupby
from operator import concat
p = ['**', '**','foo', '*', 'bar', 'bar', '**', '**', '**',
'bar','**','foo','sun','sun','sun']
print 'p==',p,'\n'
dedupl = ("**",'sun')
print 'dedupl==',repr(dedupl)
print [ x for k, g in groupby(p) for x in ((k,) if k in dedupl else g) ]
# or
print reduce(concat,( [k] if k in dedupl else list(g) for k, g in groupby(p)),[])
基于相同的原理,很容易将dugres的功能转换为生成器功能:
from itertools import groupby
def compress(iterable, to_compress):
for k, g in groupby(iterable):
if k in to_compress:
yield k
else:
for x in g: yield x
但是,这个生成器函数有两个缺点:
它转向函数groupby() ,这对于不熟悉 Python的人来说并不容易理解
它的执行时间长于我的生成器函数treat()和John Machin的生成器函数,它们不使用groupby() 。
我稍微修改了它们,使它们能够接受一系列要重复数据删除的项目,并测量了执行的持续时间:
from time import clock
from itertools import groupby
def squeeze(iterable, victims, _dummy=object()):
if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
victims = (victims,)
previous = _dummy
for item in iterable:
if item in victims and item==previous:
continue
previous = item
yield item
def treat(B,victims):
if hasattr(B, '__iter__') and not hasattr(victims, '__iter__'):
victims = (victims,)
B = iter(B)
prec = B.next(); yield prec
for x in B:
if x not in victims or x!=prec: yield x
prec = x
def compress(iterable, to_compress):
if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
to_compress = (to_compress,)
for k, g in groupby(iterable):
if k in to_compress:
yield k
else:
for x in g: yield x
p = ['**', '**','su','foo', '*', 'bar', 'bar', '**', '**', '**',
'su','su','**','bin', '*','*','bar','bar','su','su','su']
n = 10000
te = clock()
for i in xrange(n):
a = list(compress(p,('**','sun')))
print clock()-te,' generator function with groupby()'
te = clock()
for i in xrange(n):
b = list(treat(p,('**','sun')))
print clock()-te,' generator function eyquem'
te = clock()
for i in xrange(n):
c = list(squeeze(p,('**','sun')))
print clock()-te,' generator function John Machin'
print p
print 'a==b==c is ',a==b==c
print a
指示
if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
to_compress = (to_compress,)
当iterable参数是一个序列而另一个参数只有一个字符串时,必须避免错误:后者需要被修改为容器,前提是iterable参数本身不是字符串。
它是基于这样的事实:像元组,列表,stes这样的序列具有iter方法,但字符串却没有。 以下代码显示了问题:
def compress(iterable, to_compress):
if hasattr(iterable, '__iter__') and not hasattr( to_compress, '__iter__'):
to_compress = (to_compress,)
print 't_compress==',repr(to_compress)
for k, g in groupby(iterable):
if k in to_compress:
yield k
else:
for x in g: yield x
def compress_bof(iterable, to_compress):
if not hasattr(to_compress, '__iter__'): # to_compress is a string
to_compress = (to_compress,)
print 't_compress==',repr(to_compress)
for k, g in groupby(iterable):
if k in to_compress:
yield k
else:
for x in g: yield x
def compress_bug(iterable, to_compress_bug):
print 't_compress==',repr(to_compress_bug)
for k, g in groupby(iterable):
#print 'k==',k,k in to_compress_bug
if k in to_compress_bug:
yield k
else:
for x in g: yield x
q = ';;;htr56;but78;;;;$$$$;ios4!'
print 'q==',q
dedupl = ";$"
print 'dedupl==',repr(dedupl)
print
print "''.join(compress (q,"+repr(dedupl)+")) :\n",''.join(compress (q,dedupl))+\
' <-CORRECT ONE'
print
print "''.join(compress_bof(q,"+repr(dedupl)+")) :\n",''.join(compress_bof(q,dedupl))+\
' <====== error ===='
print
print "''.join(compress_bug(q,"+repr(dedupl)+")) :\n",''.join(compress_bug(q,dedupl))
print '\n\n\n'
q = [';$', ';$',';$','foo', ';', 'bar','bar',';',';',';','$','$','foo',';$12',';$12']
print 'q==',q
dedupl = ";$12"
print 'dedupl==',repr(dedupl)
print
print 'list(compress (q,'+repr(dedupl)+')) :\n',list(compress (q,dedupl)),\
' <-CORRECT ONE'
print
print 'list(compress_bof(q,'+repr(dedupl)+')) :\n',list(compress_bof(q,dedupl))
print
print 'list(compress_bug(q,'+repr(dedupl)+')) :\n',list(compress_bug(q,dedupl)),\
' <====== error ===='
print
结果
q== ;;;htr56;but78;;;;$$$$;ios4!
dedupl== ';$'
''.join(compress (q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4! <-CORRECT ONE
''.join(compress_bof(q,';$')) :
t_compress== (';$',)
;;;htr56;but78;;;;$$$$;ios4! <====== error ====
''.join(compress_bug(q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4!
q== [';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12', ';$12']
dedupl== ';$12'
list(compress (q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12'] <-CORRECT ONE
list(compress_bof(q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12']
list(compress_bug(q,';$12')) :
t_compress== ';$12'
[';$', 'foo', ';', 'bar', 'bar', ';', '$', 'foo', ';$12'] <====== error ====
我获得了以下执行时间:
0.390163274941 generator function with groupby()
0.324547114228 generator function eyquem
0.310176572721 generator function John Machin
['**', '**', 'su', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']
a==b==c is True
['**', 'su', 'foo', '*', 'bar', 'bar', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']
我更喜欢John Machin的解决方案,因为我没有指令B = iter(B)。
但是带有_dummy = object()
的指令previous = _dummy
对我来说很奇怪。 所以我最终认为更好的解决方案是以下代码,即使使用字符串作为可迭代参数,并且其中先前定义的第一个对象不是假的:
def squeeze(iterable, victims):
if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
victims = (victims,)
for item in iterable:
previous = item
break
for item in iterable:
if item in victims and item==previous:
continue
previous = item
yield item
。
我不喜欢对象()被用作哨兵。
但我很困惑的事实对象被调用。 昨天,我认为对象是如此奇特,以至于对象不可能作为squeeze()的参数传递给任何迭代。 所以,我想知道你为什么叫它,John Machin,并且在我的脑海里播下了关于它的性质的怀疑; 这就是为什么我问你确认对象是超级元类的原因。
但今天,我想我理解为什么在你的代码中调用对象 。
事实上, 对象很可能是可迭代的,为什么不呢? 超级元类对象是一个对象,因此在迭代器上处理重复数据删除之前,没有什么能阻止它被放入迭代中,谁知道呢? 然后使用对象本身作为哨兵是不正确的做法。
。
所以你没有使用object而是使用实例对象()作为哨兵。
但我想知道为什么选择这个神秘的东西,对对象的回复是什么?
关于这一点,我的思考继续进行,我评论了这个问题的原因:
调用对象创建一个实例,因为对象是Python中最基类的,每次创建实例时,它都是与任何先前创建的实例不同的对象,其值始终不同于任何先前对象的实例的值:
a = object()
b = object()
c = object()
d = object()
print id(a),'\n',id(b),'\n',id(c),'\n',id(d)
print a==b,a==c,a==d
print b==c,b==d,c==d
结果
10818752
10818760
10818768
10818776
False False False
False False False
所以它确定_dummy=object()
是一个唯一的对象,具有唯一的id和唯一的值。 顺便说一句,我想知道对象实例的价值是什么。 无论如何,下面的代码显示_dummy=object
的问题,并且_dummy=object()
没有问题
def imperfect_squeeze(iterable, victim, _dummy=object):
previous = _dummy
print 'id(previous) ==',id(previous)
print 'id(iterable[0])==',id(iterable[0])
for item in iterable:
if item in victim and item==previous: continue
previous = item; yield item
def squeeze(iterable, victim, _dummy=object()):
previous = _dummy
print 'id(previous) ==',id(previous)
print 'id(iterable[0])==',id(iterable[0])
for item in iterable:
if item in victim and item==previous: continue
previous = item; yield item
wat = object
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_squeeze\n''li before ==',li
print map(id,li)
li = list(imperfect_squeeze(li,[wat,'**']))
print 'li after ==',li
print
wat = object()
li = [wat,'**','**','foo',wat,wat]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[wat,'**']))
print 'li after ==',li
print
li = [object(),'**','**','foo',object(),object()]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[li[0],'**']))
print 'li after ==',li
结果
imperfect_squeeze
li before == [<type 'object'>, '**', '**', 'foo', <type 'object'>, <type 'object'>]
[505317320, 18578968, 18578968, 13208848, 505317320, 505317320]
id(previous) == 505317320
id(iterable[0])== 505317320
li after == ['**', 'foo', <type 'object'>]
squeeze
li before == [<object object at 0x00A514C8>, '**', '**', 'foo', <object object at 0x00A514C8>, <object object at 0x00A514C8>]
[10818760, 18578968, 18578968, 13208848, 10818760, 10818760]
id(previous) == 10818752
id(iterable[0])== 10818760
li after == [<object object at 0x00A514C8>, '**', 'foo', <object object at 0x00A514C8>]
squeeze
li before == [<object object at 0x00A514D0>, '**', '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]
[10818768, 18578968, 18578968, 13208848, 10818776, 10818784]
id(previous) == 10818752
id(iterable[0])== 10818768
li after == [<object object at 0x00A514D0>, '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]
问题在于缺少<type 'object'>
作为unffect_squeeze()处理后列表的第一个元素。
但是,我们必须注意到“问题”只有一个列表的第一个元素是对象才有可能:这是关于这么小概率的很多反思......但是一个严格的编码器会考虑所有。
如果我们使用list而不是object ,结果会有所不同:
def imperfect_sqlize(iterable, victim, _dummy=list):
previous = _dummy
print 'id(previous) ==',id(previous)
print 'id(iterable[0])==',id(iterable[0])
for item in iterable:
if item in victim and item==previous: continue
previous = item; yield item
def sqlize(iterable, victim, _dummy=list()):
previous = _dummy
print 'id(previous) ==',id(previous)
print 'id(iterable[0])==',id(iterable[0])
for item in iterable:
if item in victim and item==previous: continue
previous = item; yield item
wat = list
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_sqlize\n''li before ==',li
print map(id,li)
li = list(imperfect_sqlize(li,[wat,'**']))
print 'li after ==',li
print
wat = list()
li = [wat,'**','**','foo',wat,wat]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[wat,'**']))
print 'li after ==',li
print
li = [list(),'**','**','foo',list(),list()]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[li[0],'**']))
print 'li after ==',li
结果
imperfect_sqlize
li before == [<type 'list'>, '**', '**', 'foo', <type 'list'>, <type 'list'>]
[505343304, 18578968, 18578968, 13208848, 505343304, 505343304]
id(previous) == 505343304
id(iterable[0])== 505343304
li after == ['**', 'foo', <type 'list'>]
sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734936, 18578968, 18578968, 13208848, 18734936, 18734936]
id(previous) == 18734656
id(iterable[0])== 18734936
li after == ['**', 'foo', []]
sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734696, 18578968, 18578968, 13208848, 18735016, 18734816]
id(previous) == 18734656
id(iterable[0])== 18734696
li after == ['**', 'foo', []]
是否有任何其他的对象不是对象在Python具有此特点?
John Machin,为什么在生成器函数中选择对象实例作为哨兵? 你已经知道上面的特点了吗?
一个广义的“pythonic”解决方案,适用于任何可迭代的(没有备份,没有复制,没有索引,没有切片,如果iterable为空则不会失败)和任何 to-squeeze( 包括None ):
>>> test = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
... 'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]
>>>
>>> def squeeze(iterable, victim, _dummy=object()):
... previous = _dummy
... for item in iterable:
... if item == victim == previous: continue
... previous = item
... yield item
...
>>> print test
['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**', 'foo', '*'
, '*', 'bar', 'bar', 'bar', '**', '**', 'foo', 'bar']
>>> print list(squeeze(test, "**"))
['**', 'foo', '*', 'bar', 'bar', '**', 'baz', '**', 'foo', '*', '*', 'bar', 'bar
', 'bar', '**', 'foo', 'bar']
>>> print list(squeeze(["**"], "**"))
['**']
>>> print list(squeeze(["**", "**"], "**"))
['**']
>>> print list(squeeze([], "**"))
[]
>>>
更新了@eyquem的启发,他说victim
不能是一个序列(或者,可能是一个集合)。
拥有受害者容器意味着有两种可能的语义:
>>> def squeeze2(iterable, victims, _dummy=object()):
... previous = _dummy
... for item in iterable:
... if item == previous in victims: continue
... previous = item
... yield item
...
>>> def squeeze3(iterable, victims, _dummy=object()):
... previous = _dummy
... for item in iterable:
... if item in victims and previous in victims: continue
... previous = item
... yield item
...
>>> guff = "c...d..e.f,,,g,,h,i.,.,.,.j"
>>> print "".join(squeeze2(guff, ".,"))
c.d.e.f,g,h,i.,.,.,.j
>>> print "".join(squeeze3(guff, ".,"))
c.d.e.f,g,h,i.j
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.