简体   繁体   English

如何根据时间戳从字符串列表中删除重复项

[英]How to remove duplicates from list of strings based on timestamp

I have the following list:我有以下列表:

ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 15:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]

I only want to keep entries where the text is unique (xyz and abc), and where the timestamp is newer.我只想保留文本唯一(xyz 和 abc)以及时间戳更新的条目。 This is my expected outcome:这是我的预期结果:

ls = ["2022-07-17 16:00:02 txt xyz", "2022-07-17 16:00:02 txt abc"]

My approach was to use a dictionary sorted by value, but then I still don't know how to remove the older timestamp.我的方法是使用按值排序的字典,但是我仍然不知道如何删除较旧的时间戳。

import re

keep_message = {}
for i in range(len(ls)):
    timestamp_str = re.search(r"^(.*?) txt", ls[i]).group(1)
    timestamp = datetime.datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
    text = re.search(r"txt (.*?)$", ls[i]).group(1)
    keep_message[text + "_" + timestamp_str] = timestamp

keep_message_sorted = dict(sorted(keep_message.items(), key=lambda item: item[1]))

Is there a better solution?有更好的解决方案吗?

Use a dictionary to keep track of the most recent date per text:使用字典来跟踪每个文本的最新日期:

d = {}
for x in ls:
    # get txt (NB. you can also use a regex)
    ts, txt = x.split(' txt ', 1)
    if txt not in d or x > d[txt]:
        d[txt] = x

out = list(d.values())

NB.注意。 I used a simple split to get the txt and also performed the comparison on the full string as the date is first and in a format compatible with sorting as string.我使用了一个简单的split来获取 txt,并且还对完整的字符串进行了比较,因为日期是第一位的,并且格式与作为字符串排序兼容。 However, you can use another extraction method (regex), and perform the comparison only on the datetime part.但是,您可以使用另一种提取方法(正则表达式),并仅对日期时间部分执行比较。

Output:输出:

['2022-07-17 16:00:02 txt xyz', '2022-07-17 16:00:02 txt abc']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM