how to use positive and negative look ahead for multiple terms in Python?

Question

I have a data frame like as shown below

df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
                   'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
                           'department 6789 DOSE 8000 units on DIALYSIS days  -  IV Interm',
                           'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
                           'insulin MixTARD  30/70 - inJECTable 46 units',
                           'insulin ISOPHANE -- InsulaTARD  Vial -  inJECTable 56 units  SC SubCutaneous',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times a week  -  IV Intermittent',
                           'jevity liquid - FEEDS PO  Jevity  -  237 mL  -  1 times per day',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times per week  -  IV Intermittent',
                           '1-supported DOSE 1 mcg  - 1 time/day  -  IV Intermittent',
                           '1-testpackage DOSE 1 mcg  - 1 time a day  -  IV Intermittent']})

I would like to remove the words/strings which follow patterns such as 46 units , 3 times a week , 3 times per week , 1 time/day etc.

I was reading about positive and negative look ahead and behind.

So, was trying something like below

[^([0-9\s]*(?=units))]  #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants

time variants ex: 3 times a day , 3 time/wk , 3 times per day , 3 times a month , 3 times/month etc.

Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)

Answer 1

You can consider a pattern like

\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))

See the regex demo

NOTE : the \\d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers? , for example.

Pattern details

\\s* - zero or more whitespace chars
\\d+ - one or more digits
\\s* - zero or more whitespaces
(?:units?|times?(?:\\s+(?:a|per)\\s+|\\s*/\\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
- units? - unit or units
- | - or
- times? - time or times
- (?:\\s+(?:a|per)\\s+|\\s*/\\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
- (?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day , or wk or week , or month , or y / yea / yr

If you need to match whole words only, use word boundaries, \\b :

\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b

In Pandas, use

df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')

how to use positive and negative look ahead for multiple terms in Python?

Question

1 answers

solution1
3 ACCPTED 2020-10-01 14:14:23

how to use positive and negative look ahead for multiple terms in Python?

Question

1 answers

solution1 3 ACCPTED 2020-10-01 14:14:23

solution1
3 ACCPTED 2020-10-01 14:14:23