简体   繁体   中英

how to use positive and negative look ahead for multiple terms in Python?

I have a data frame like as shown below

df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
                   'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
                           'department 6789 DOSE 8000 units on DIALYSIS days  -  IV Interm',
                           'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
                           'insulin MixTARD  30/70 - inJECTable 46 units',
                           'insulin ISOPHANE -- InsulaTARD  Vial -  inJECTable 56 units  SC SubCutaneous',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times a week  -  IV Intermittent',
                           'jevity liquid - FEEDS PO  Jevity  -  237 mL  -  1 times per day',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times per week  -  IV Intermittent',
                           '1-supported DOSE 1 mcg  - 1 time/day  -  IV Intermittent',
                           '1-testpackage DOSE 1 mcg  - 1 time a day  -  IV Intermittent']})

I would like to remove the words/strings which follow patterns such as 46 units , 3 times a week , 3 times per week , 1 time/day etc.

I was reading about positive and negative look ahead and behind.

So, was trying something like below

[^([0-9\s]*(?=units))]  #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants

time variants ex: 3 times a day , 3 time/wk , 3 times per day , 3 times a month , 3 times/month etc.

Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)

在此处输入图片说明

You can consider a pattern like

\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))

See the regex demo

NOTE : the \\d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers? , for example.

Pattern details

  • \\s* - zero or more whitespace chars
  • \\d+ - one or more digits
  • \\s* - zero or more whitespaces
  • (?:units?|times?(?:\\s+(?:a|per)\\s+|\\s*/\\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
    • units? - unit or units
    • | - or
    • times? - time or times
    • (?:\\s+(?:a|per)\\s+|\\s*/\\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
    • (?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day , or wk or week , or month , or y / yea / yr

If you need to match whole words only, use word boundaries, \\b :

\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b

In Pandas, use

df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM