简体   繁体   English

正则表达式 <.*?>

[英]Regular expression <.*?>

I'm new to Python, and have been looking at text cleaning examples such as this on Kaggle, and had a few questions on the stemming and stopwords part.我是新来的Python,并一直在寻找的文字清洁的例子,如对Kaggle,且对制止和停用词部分的一些问题。

for sentence in final_X:
  sentence = sentence.lower()                 # Converting to lowercase
  cleanr = re.compile('<.*?>')
  sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
  sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
  sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations

What does cleanr = re.compile('<.*?>') mean? cleanr = re.compile('<.*?>')是什么意思? I understand that it is used to remove HTML tags, but I'm confused as to what <.*?> means.我知道它用于删除 HTML 标签,但我对<.*?>含义感到困惑。

<.*?> is called a regular expression pattern . <.*?>被称为正则表达式模式 This pattern in particular looks for a < and then any number of characters .* followed by a > .该模式特别查找< ,然后是任意数量的字符.*后跟> The period .期间. denotes "any character" and the asterisk * denotes "0 or more times".表示“任何字符”,星号*表示“0 次或多次”。 The question mark ?问号? after the * is the lazy quantifier which indicates that it should consume as few characters as possible. *后面是惰性量词,表示它应该使用尽可能少的字符。 This is important for a case like this:这对于这样的情况很重要:

<p> 2 > 1 </p>

If the question mark was not there, the pattern would match this: <p> 2 > 1 </p>如果问号不存在,则模式将匹配: <p> 2 > 1 </p>

Whereas if the question is there, it would look for the FIRST > that follows a < like this: <p> 2 > 1 </p> which is the desired result而如果问题那里,它会寻找 FIRST >跟在<像这样: <p> 2 > 1 </p>这是期望的结果

This is a regex pattern.这是一个正则表达式模式。 This pattern <.*?> means a string with < followed by any character (coded by the point) repeated any number of times (coded by the star) and finally > The question mark after the star is to make it lazy, that is the match stops at the first > encountered after < .此图案<.*?>指具有串<后跟任何字符(由点编码)重复任意次数(由星形编码),最后>星后问号是使其懒惰,即匹配在<之后遇到的第一个>处停止。

See https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 to learn more about regular expressions.请参阅https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285以了解有关正则表达式的更多信息。

Actually, it can be其实可以

sentence = re.sub('<.*?>', ' ', sentence)

instead of代替

cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence)

or else cleanr = re.compile('<.*?>') should be before the for loop for optimization.否则cleanr = re.compile('<.*?>')应该在for循环之前进行优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM