I have a dataframe with holiday names. I have a problem that on some days, holidays are observed on different days, sometimes on the day of another holiday. Here are some example problems:
1 "Independence Day (Observed)"
2 "Christmas Eve, Christmas Day (Observed)"
3 "New Year's Eve, New Year's Day (Observed)"
4 "Martin Luther King, Jr. Day"
I want to replace all ' (Observed)' with '' and everything before a comma only if ' (Observed)' is matched. Output should be:
1 "Independence Day"
2 "Christmas Day"
3 "New Year's Day"
4 "Martin Luther King, Jr. Day"
I was able to do both independently:
(foo['holiday']
.replace(to_replace=' \(Observed\)', value='', regex=True)
.replace(to_replace='.+, ', value='', regex=True))
but that caused a problem with 'Martin Luther King, Jr. Day'.
import re
input = [
"Independence Day (Observed)",
"Christmas Eve, Christmas Day (Observed)",
"New Year's Eve, New Year's Day (Observed)",
"Martin Luther King, Jr. Day"
]
for holiday in input:
print re.sub('^(.*?, )?(.*?)( \(Observed\))$', '\\2', holiday)
> python replace.py
Independence Day
Christmas Day
New Year's Day
Martin Luther King, Jr. Day
^
: Match at start of string. (.*?, )?
: Match anything followed by a command and a space. Make it a lazy match, so it doesn't consume the portion of the string we want to keep. The last ?
makes the whole thing optional, because some of the sample input doesn't have a comma at all. (.*?)
: Grab the part we want for later use in a capturing group. This part is also a lazy match because... ( \\(Observed\\))
: Some strings might have " (Observed)" on the end, so we declare that in a separate group here. The lazy match in the prior piece won't consume this. $
: Match at end of string. I suggest
r'^(?:.*,\s*)?\b([^,]+)\s+\(Observed\).*'
Replace with r'\\1'
backreference.
See the regex demo .
Pattern details :
^
- start of string (?:.*,\\s*)?
- an optional sequence of:
.*,
- any 0+ chars other than line break chars as many as possible, up to the last occurrence of ,
on the line and then the ,
\\s*
- 0 or more whitespaces \\b
- a word boundary ([^,]+)
- 1 or more chars other than ,
\\s+
- 1 or more whitespaces \\(Observed\\)
- a literal substring (Observed)
.*
- any 0+ chars other than line break chars as many as possible up to the line end.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.