简体   繁体   中英

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:

test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv

The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_ , so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.

This is what I have implemented thus far:

datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")

But the result I get is only the date part:

20180102

But what I actually need is:

0800_20180101

That's quite simple:

match = re.search(r"_((\d+)_(\d+))_", your_string)

print(match.group(1))  # print time_date >> 0800_20180101
print(match.group(2))  # print time >> 0800
print(match.group(3))  # print date >> 20180101

Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).

The order in which you then access the groups is from 1-n_specified , where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.

On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.

They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.

with open('myfile.txt', 'r') as f:
    for line in f:
        x = '_'.join(line.split('_')[1:3])
        print(x)

The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:

re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)

gives:

'0800_20180102'

This is very easy to do with .split() :

time = filename.split("_")[1]
date = filename.split("_")[2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM