简体   繁体   English

Python-按正则表达式日期匹配对文件中的行进行排序

[英]Python - Sort Lines in File by Regex Date Match

I have a file (based on a class project) of scraped Tweets. 我有一个刮擦推文的文件(基于课程项目)。 At this point lines in the file look like: 此时文件中的行如下所示:

@soandso something something <a href="http://pic.twitter.com/aphoto</a><a href="a link" target="_blank">Permalink</a> 1:40 PM - 17 Feb 2016<br><br>
@soandso something something <a href="http://pic.twitter.com/aphoto</a><a href="a link" target="_blank">Permalink</a> 1:32 PM - 16 Feb 2016<br><br>

I'm trying to sort the lines in the file by date. 我正在尝试按日期对文件中的行进行排序。 This is what I've cobbled together so far. 到目前为止,这是我拼凑的。

import re
from datetime import datetime

when = re.compile(r".+</a>(.+)<br><br>")

with open('tweets.txt','r+') as outfile:
    sortme = outfile.read()

    for match in re.finditer(when, sortme):
        tweet = match.group(0)
        when = match.group(1)
        when = datetime.strptime(when, " %I:%M %p - %d %b %Y")
        print when

Which will print out all the dates in the lines having converted the format from 1:40 PM - 17 Feb 2016 to 2016-02-17 13:40:00, which I believe is a datetime. 它将在将格式从2016年2月17日下午1:40转换为2016-02-17 13:40:00的行中打印出所有日期,我相信这是一个日期时间。 I have searched high and low over the last few days for clues about how I'd then sort all the lines in the file by datetime. 在过去的几天里,我一直在搜索高低寻找有关如何按照日期时间对文件中的所有行进行排序的线索。 Thanks for your help! 谢谢你的帮助!

I have searched high and low over the last few days for clues about how I'd then sort all the lines in the file by datetime. 在过去的几天里,我一直在搜索高低寻找有关如何按照日期时间对文件中的所有行进行排序的线索。

def get_time(line):
    match = re.search(r"</a>\s*(.+?)\s*<br><br>", line)
    if match:
        return datetime.strptime(match.group(1), "%I:%M %p - %d %b %Y")
    return datetime.min

lines.sort(key=get_time)

It assumes that the time is monotonous in the given time period (eg, no DST transitions) otherwise you should convert the input time to UTC (or POSIX timestamp) first. 假定时间在给定时间段内是单调的(例如,没有DST转换),否则您应首先将输入时间转换为UTC(或POSIX时间戳)。

It seems you have already solved the regex problem... so to convert your datetime into a measurable quantity convert to seconds like so: 看来您已经解决了正则表达式问题...因此要将日期时间转换为可测量的数量,请转换为秒,如下所示:

import time
time.mktime(when.timetuple())

then for sorting you can make a lot off different routes. 然后进行分类,您可以从不同的路线中获得很多收益。 the simplest example is: 最简单的示例是:

import operator
s = [("ab",50),("cd",100),("ef",15)]
print sorted(s,key=operator.itemgetter(1))
## [('ef', 15), ('ab', 50), ('cd', 100)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM