简体   繁体   中英

How to extract the starting time from a string in Python?

I have a CSV, in which there is a column of working hours of an institute. But those are not uniformly formatted, so the entries of that column look like,

8:30 AM-3:30 PM
9:00  AM - 4:15  PM
08:00 AM-03:00 PM
M, T, W, Th: 7:45 AM-3:05 PM F:  7:45 AM-2:07 PM
8:15/8:45 AM-3:15/3:45 PM

So my goal is to find a starting hour for each row, so my expected output should be something like:

Output:
8:30 AM
9:00  AM
08:00 AM
M, T, W, Th: 7:45 AM F:  7:45 AM
8:15/8:45 AM

I have tried using

str.split("AM")

but since the formatting of the string is non-uniform, it does not really work well in the cases like

M, T, W, Th: 7:45 AM-3:05 PM F:  7:45 AM-2:07 PM

Also, an extension to this, what is the neat way to plot the histogram/distribution of 'Starting hour'? How can I convert the string data from this column to time data and plot it?

I'd use spaCy NER for this task, instead of relying on your own regular expressions.

import spacy


nlp = spacy.load("en_core_web_sm")  # nlp model installed separately

lines = [
    "8:30 AM-3:30 PM",
    "9:00  AM - 4:15  PM",
    "M, T, W, Th: 7:45 AM-3:05 PM F:  7:45 AM-2:07 PM",
]

delimiter = "-"

for line in lines:
    start_chunk = line.split(delimiter)[0]
    doc = nlp(start_chunk)
    ents = [(e.text, e.label_) for e in doc.ents]
    print("Extracted ents: {}".format(ents))
    print("Start time: {}".format(ents[0][0]))

The output is as follows:

Extracted ents: [('8:30 AM', 'TIME')]
Start time: 8:30 AM
Extracted ents: [('9:00  AM', 'TIME')]
Start time: 9:00  AM
Extracted ents: [('7:45 AM', 'TIME')]
Start time: 7:45 AM

Hope this helps.

Regex to the rescue. What you want to do based on your question is to remove closing times. You can use the following regex replace, where you replace the match with nothing:

re.sub(r'-\s*[0-9]:[0-9]{2}(/[0-9]:[0-9]{2})?\s*[AP]M', "", data)

It works like this:

- Matches the second half of the #:##AM - #:##PM time

\\s* Matches any number of whitespaces (if there are any)

[0-9]:[0-9]{2} Matches one number followed by two numbers, separated by :

(/[0-9]:[0-9]{2})? Matches the second time separated by \\ if present (it's made optional by the ?

\\s* Matches spaces again

[AP]M Matches AM or PM

For the following input:

data = "8:30 AM-3:30 PM\n9:00  AM - 4:15  PM\n08:00 AM-03:00 PM\nM, T, W, Th: 7:45 AM-3:05 PM F:  7:45 AM-2:07 PM\n8:15/8:45 AM-3:15/3:45 PM"

It outputted:

>>> re.sub(r'-\s*[0-9]:[0-9]{2}(/[0-9]:[0-9]{2})?\s*[AP]M', "", data)
'8:30 AM\n9:00  AM \n08:00 AM-03:00 PM\nM, T, W, Th: 7:45 AM F:  7:45 AM\n8:15/8:45 AM'

Further reading: re.sub , re docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM