正则表达式 <.*?>

Question

I'm new to Python, and have been looking at text cleaning examples such as this on Kaggle, and had a few questions on the stemming and stopwords part.我是新来的Python，并一直在寻找的文字清洁的例子，如这对Kaggle，且对制止和停用词部分的一些问题。

for sentence in final_X:
  sentence = sentence.lower()                 # Converting to lowercase
  cleanr = re.compile('<.*?>')
  sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
  sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
  sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations

What does cleanr = re.compile('<.*?>') mean? cleanr = re.compile('<.*?>')是什么意思？ I understand that it is used to remove HTML tags, but I'm confused as to what <.*?> means.我知道它用于删除 HTML 标签，但我对<.*?>含义感到困惑。

Answer 1

<.*?> is called a regular expression pattern . <.*?>被称为正则表达式模式。 This pattern in particular looks for a < and then any number of characters .* followed by a > .该模式特别查找< ，然后是任意数量的字符.*后跟> 。 The period .期间. denotes "any character" and the asterisk * denotes "0 or more times".表示“任何字符”，星号*表示“0 次或多次”。 The question mark ?问号? after the * is the lazy quantifier which indicates that it should consume as few characters as possible. *后面是惰性量词，表示它应该使用尽可能少的字符。 This is important for a case like this:这对于这样的情况很重要：

 2 > 1 

If the question mark was not there, the pattern would match this:  2 > 1 如果问号不存在，则模式将匹配：  2 > 1 

Whereas if the question is there, it would look for the FIRST > that follows a < like this:  2 > 1  which is the desired result而如果问题在那里，它会寻找 FIRST >跟在<像这样：  2 > 1 这是期望的结果

Answer 2

This is a regex pattern.这是一个正则表达式模式。 This pattern <.*?> means a string with < followed by any character (coded by the point) repeated any number of times (coded by the star) and finally > The question mark after the star is to make it lazy, that is the match stops at the first > encountered after < .此图案<.*?>指具有串<后跟任何字符（由点编码）重复任意次数（由星形编码），最后>星后问号是使其懒惰，即匹配在<之后遇到的第一个>处停止。

See https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 to learn more about regular expressions.请参阅https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285以了解有关正则表达式的更多信息。

Actually, it can be其实可以

sentence = re.sub('<.*?>', ' ', sentence)

instead of代替

cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence)

or else cleanr = re.compile('<.*?>') should be before the for loop for optimization.否则cleanr = re.compile('<.*?>')应该在for循环之前进行优化。

正则表达式 <.*?>

问题描述

2 个解决方案

解决方案1
2 2019-12-09 01:10:33

解决方案2
0 2019-12-09 00:52:55

正则表达式 &lt;.*?&gt;

问题描述

2 个解决方案

解决方案1 2 2019-12-09 01:10:33

解决方案2 0 2019-12-09 00:52:55

正则表达式 <.*?>

解决方案1
2 2019-12-09 01:10:33

解决方案2
0 2019-12-09 00:52:55