[英]How to remove duplicates from list of strings based on timestamp
I have the following list:我有以下列表:
ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 15:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]
I only want to keep entries where the text is unique (xyz and abc), and where the timestamp is newer.我只想保留文本唯一(xyz 和 abc)以及时间戳更新的条目。 This is my expected outcome:
这是我的预期结果:
ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]
My approach was to use a dictionary sorted by value, but then I still don't know how to remove the older timestamp.我的方法是使用按值排序的字典,但是我仍然不知道如何删除较旧的时间戳。
import re
keep_message = {}
for i in range(len(ls)):
timestamp_str = re.search(r"^(.*?) txt", ls[i]).group(1)
timestamp = datetime.datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
text = re.search(r"txt (.*?)$", ls[i]).group(1)
keep_message[text + "_" + timestamp_str] = timestamp
keep_message_sorted = dict(sorted(keep_message.items(), key=lambda item: item[1]))
Is there a better solution?有更好的解决方案吗?
Use a dictionary to keep track of the most recent date per text:使用字典来跟踪每个文本的最新日期:
d = {}
for x in ls:
# get txt (NB. you can also use a regex)
ts, txt = x.split(' txt ', 1)
if txt not in d or x > d[txt]:
d[txt] = x
out = list(d.values())
NB.注意。 I used a simple
split
to get the txt and also performed the comparison on the full string as the date is first and in a format compatible with sorting as string.我使用了一个简单的
split
来获取 txt,并且还对完整的字符串进行了比较,因为日期是第一位的,并且格式与作为字符串排序兼容。 However, you can use another extraction method (regex), and perform the comparison only on the datetime part.但是,您可以使用另一种提取方法(正则表达式),并仅对日期时间部分执行比较。
Output:输出:
['2022-07-17 16:00:02 txt xyz', '2022-07-17 16:00:02 txt abc']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.