简体   繁体   中英

Regular expression to extract the first part of a string

I have the following list of phrases:

[
  'This is erleada comp. recub. con película 60 mg.',
  'This is auxina e-200 uicaps. blanda 200 mg.',
  'This is ephynalsol. iny. 100 mg.',
  'This is paracethamol 100 mg.'
]

I need to get the following result:

[
  'This is erleada.',
  'This is auxina.',
  'This is ephynalsol.',
  'This is paracethamol.'
]

I wrote the following function to clean phrases:

def clean(string):
    sub_strings = [".","iny","comp","uicaps]
    try:
        string = [string[:string.index(sub_str)].rstrip() for sub_str in sub_strings]
        return string
    except:
        return string

and use it as follows:

for phrase in phrases:
    drug = clean(phrase)

This should do it:

import re

phrases = [
  'This is erleada comp. recub. con película 60 mg.',
  'This is auxina e-200 uicaps. blanda 200 mg.',
  'This is ephynalsol. iny. 100 mg.',
  'This is paracethamol 100 mg.'
]

pattern = re.compile("^This is \w*")

for phrase in phrases:
    match = pattern.search(phrase)
    print(match.group(0) + ".")

Outputs:

This is erleada.
This is auxina.
This is ephynalsol.
This is paracethamol.

Explanation: You see we have used a regex pattern ^This is \w* . Here is how it works.

  • ^ means the start of the line. So ^This is means your line must start with This is .
  • \w matches the following single character range az , AZ , 0-9 , and _
  • \w* in the previous point, I said \w matches a single character within a-zA-Z0-9_ range. Notice that there is a * after \w . * stands for zero or more. If you use * after \w , it will match all the characters that are satisfied by \w and placed one after another.
  • In a nutshell: ^This is means start with This is and \w* means match all characters that are within the range of \w . Since space, comma, full stops are not satisfied by \w , it will stop matching at that point and return something like This is something.

You could obtain the same results with slicing:

phrases=[
  'This is erleada comp. recub. con película 60 mg.',
  'This is auxina e-200 uicaps. blanda 200 mg.',
  'This is ephynalsol. iny. 100 mg.',
  'This is paracethamol 100 mg.'
]

drug =[sentence if sentence[-1]=="." else sentence+"." for sentence in [" ".join(phrase) for phrase in [x.split()[0:3] for x in phrases]]]

The code takes the first three words from your sentences and puts them in a list, and adds a period after the third word. But of course, the previous provided regex solution is much nicer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM