[英]Regular expression <.*?>
I'm new to Python, and have been looking at text cleaning examples such as this on Kaggle, and had a few questions on the stemming and stopwords part.我是新来的Python,并一直在寻找的文字清洁的例子,如这对Kaggle,且对制止和停用词部分的一些问题。
for sentence in final_X:
sentence = sentence.lower() # Converting to lowercase
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence) #Removing HTML tags
sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) #Removing Punctuations
What does cleanr = re.compile('<.*?>')
mean? cleanr = re.compile('<.*?>')
是什么意思? I understand that it is used to remove HTML tags, but I'm confused as to what <.*?>
means.我知道它用于删除 HTML 标签,但我对
<.*?>
含义感到困惑。
<.*?>
is called a regular expression pattern . <.*?>
被称为正则表达式模式。 This pattern in particular looks for a <
and then any number of characters .*
followed by a >
.该模式特别查找
<
,然后是任意数量的字符.*
后跟>
。 The period .
期间
.
denotes "any character" and the asterisk *
denotes "0 or more times".表示“任何字符”,星号
*
表示“0 次或多次”。 The question mark ?
问号
?
after the *
is the lazy quantifier which indicates that it should consume as few characters as possible. *
后面是惰性量词,表示它应该使用尽可能少的字符。 This is important for a case like this:这对于这样的情况很重要:
<p> 2 > 1 </p>
If the question mark was not there, the pattern would match this: <p> 2 > 1 </p>
如果问号不存在,则模式将匹配:
<p> 2 > 1 </p>
Whereas if the question is there, it would look for the FIRST >
that follows a <
like this: <p>
2 > 1
</p>
which is the desired result而如果问题在那里,它会寻找 FIRST
>
跟在<
像这样: <p>
2 > 1
</p>
这是期望的结果
This is a regex pattern.这是一个正则表达式模式。 This pattern
<.*?>
means a string with <
followed by any character (coded by the point) repeated any number of times (coded by the star) and finally >
The question mark after the star is to make it lazy, that is the match stops at the first >
encountered after <
.此图案
<.*?>
指具有串<
后跟任何字符(由点编码)重复任意次数(由星形编码),最后>
星后问号是使其懒惰,即匹配在<
之后遇到的第一个>
处停止。
See https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 to learn more about regular expressions.请参阅https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285以了解有关正则表达式的更多信息。
Actually, it can be其实可以
sentence = re.sub('<.*?>', ' ', sentence)
instead of代替
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence)
or else cleanr = re.compile('<.*?>')
should be before the for
loop for optimization.否则
cleanr = re.compile('<.*?>')
应该在for
循环之前进行优化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.