I'm new to Python, and have been looking at text cleaning examples such as this on Kaggle, and had a few questions on the stemming and stopwords part.
for sentence in final_X:
sentence = sentence.lower() # Converting to lowercase
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence) #Removing HTML tags
sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) #Removing Punctuations
What does cleanr = re.compile('<.*?>')
mean? I understand that it is used to remove HTML tags, but I'm confused as to what <.*?>
means.
<.*?>
is called a regular expression pattern . This pattern in particular looks for a <
and then any number of characters .*
followed by a >
. The period .
denotes "any character" and the asterisk *
denotes "0 or more times". The question mark ?
after the *
is the lazy quantifier which indicates that it should consume as few characters as possible. This is important for a case like this:
<p> 2 > 1 </p>
If the question mark was not there, the pattern would match this: <p> 2 > 1 </p>
Whereas if the question is there, it would look for the FIRST >
that follows a <
like this: <p>
2 > 1
</p>
which is the desired result
This is a regex pattern. This pattern <.*?>
means a string with <
followed by any character (coded by the point) repeated any number of times (coded by the star) and finally >
The question mark after the star is to make it lazy, that is the match stops at the first >
encountered after <
.
See https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 to learn more about regular expressions.
Actually, it can be
sentence = re.sub('<.*?>', ' ', sentence)
instead of
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence)
or else cleanr = re.compile('<.*?>')
should be before the for
loop for optimization.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.