简体   繁体   中英

Regular expression <.*?>

I'm new to Python, and have been looking at text cleaning examples such as this on Kaggle, and had a few questions on the stemming and stopwords part.

for sentence in final_X:
  sentence = sentence.lower()                 # Converting to lowercase
  cleanr = re.compile('<.*?>')
  sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
  sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
  sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations

What does cleanr = re.compile('<.*?>') mean? I understand that it is used to remove HTML tags, but I'm confused as to what <.*?> means.

<.*?> is called a regular expression pattern . This pattern in particular looks for a < and then any number of characters .* followed by a > . The period . denotes "any character" and the asterisk * denotes "0 or more times". The question mark ? after the * is the lazy quantifier which indicates that it should consume as few characters as possible. This is important for a case like this:

<p> 2 > 1 </p>

If the question mark was not there, the pattern would match this: <p> 2 > 1 </p>

Whereas if the question is there, it would look for the FIRST > that follows a < like this: <p> 2 > 1 </p> which is the desired result

This is a regex pattern. This pattern <.*?> means a string with < followed by any character (coded by the point) repeated any number of times (coded by the star) and finally > The question mark after the star is to make it lazy, that is the match stops at the first > encountered after < .

See https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 to learn more about regular expressions.

Actually, it can be

sentence = re.sub('<.*?>', ' ', sentence)

instead of

cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence)

or else cleanr = re.compile('<.*?>') should be before the for loop for optimization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM