简体   繁体   English

如何从字符串中获取唯一值而不删除分隔符

[英]How to get unique values from a string without removing the delimiter

I have to remove duplicate values from a string,in which child values are separated by a delimiter. 我必须从字符串中删除重复值,其中子值由分隔符分隔。 My sample string is like "aa~*yt~*cc~*aa" where ~* is the delimiter and need to remove duplcate occurence of aa 我的样本字符串就像"aa~*yt~*cc~*aa" ,其中〜*是分隔符,需要删除aa的重复出现

I Tried using set cmmand and below code also, but they are giving output as 我尝试使用set cmmand和下面的代码,但是他们输出为

"a~*ytc"

However I need the output : 但是我需要输出:

"aa~*yt~*cc"

d = {}
s="aa~*yt~*cc~*aa"
res=[]
for c in s:
    if c not in d:
      res.append(c)
      d[c]=1
print ("".join(res))

I have gone through many answers provided, but could not able to solve this. 我已经提供了很多答案,但无法解决这个问题。 Please let me if there is any solution to it. 如果有任何解决方案,请告诉我。 Thanks and really appreciate your time :) 谢谢,真的很感谢你的时间:)

You could split the string by the separator, take the set of the resulting list (to remove duplicates), sort the elements according to the order of appearance in the original string and join setting again ~ as a delimiter: 您可以通过分隔符split字符串,获取结果listset (以删除重复项),根据原始字符串中的外观顺序对元素进行排序,并再次join设置~作为分隔符:

s = "aa~*yt~*cc~aa"

'~'.join(sorted(set(s.split('~')), key=s.index))
# 'aa~*yt~*cc'

If performance is important, define the dictionary used to sort the resulting set beforehand: 如果性能很重要,请事先定义用于对结果集进行排序的字典:

l = s.split('~')
length = len(l)
d = {j:length-i for i,j in enumerate(l[::-1])}
# {'aa': 1, '*cc': 3, '*yt': 2}
'~'.join(sorted(set(l), key=lambda x: d[x]))
# 'aa~*yt~*cc'

Is the order of the substrings relevant? 子串的顺序是否相关?

if order is not important: 如果订单不重要:

print("~".join(set("aa~*yt~*cc~aa".split("~"))))

if the order is important: 如果订单很重要:

#f7 function source: https://stackoverflow.com/a/480227/11971785
def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

print("~".join(f7("aa~*yt~*cc~aa".split("~"))))

You can use enumerate with re.findall : 您可以使用re.findall enumerate

import re
d = "aa~*yt~*cc~aa" 
new_d = re.findall('\w+|[\W]', d)
r, c = [a for i, a in enumerate(new_d) if a.isalpha() and a not in new_d[:i]], iter([i for i in new_d if not i.isalpha()])
result = ''.join(f'{a}{next(c)}{next(c)}' if i < len(r) - 1 else a for i, a in enumerate(r))

Output: 输出:

'aa~*yt~*cc'

With re.findall , the delimiter characters do not need to be known in advance. 使用re.findall ,不需要事先知道分隔符。

One common way to ensure uniqueness while maintaining order (in all Python variants) uses a collections.OrderedDict : 在维护顺序的同时确保唯一性的一种常见方法(在所有Python变体中)使用collections.OrderedDict

from collections import OrderedDict as OD

s = "aa~*yt~*cc~aa"
sep = "~"

uinq = sep.join(OD.fromkeys(s.split(sep)))
# 'aa~*yt~*cc'

Try this one: 试试这个:

>>> s="aa~*yt~*cc~aa"
>>> s_list=s.split("~")
>>> s_final = "~".join([s_list[i] for i in range(len(s_list)) if s_list[0:i].count(s_list[i])==0])
>>> s_final
'aa~*yt~*cc'

Since python 3.7 dicts are ordered, so you can use them 由于python 3.7 dicts是有序的,所以你可以使用它们

>>> '~'.join(dict.fromkeys("aa~yt~cc~aa".split('~')).keys())
'aa~yt~cc'

for other python versions you can use this solution https://stackoverflow.com/a/57758708/7851254 对于其他python版本,您可以使用此解决方案https://stackoverflow.com/a/57758708/7851254

However, i wouldn't recommend to use such unobvious feature. 但是,我不建议使用这种不明显的功能。 You can stick to some other answer, just choose one that is understandable from first look 你可以坚持一些其他答案,只需选择一个从初看起来可以理解的答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM