简体   繁体   English

正则表达式以匹配HTML中的属性?

[英]Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage. 我有一个txt文件,它实际上是某些网页的html源。 Inside that txt file there are various strings preceded by a "title=" tag. 在该txt文件中,有各种字符串,前面带有“ title =“标签。 eg 例如

<div id='UWTDivDomains_5_6_2_2'  title='Connectivity Framework'> 

I am interested in getting the text Connectivity Framework to be extraced and written to a separate file. 我有兴趣获取文本连接框架 ,并将其写入单独的文件中。

Like this, there are many such tags each having a different text after the title='some text here which i need to extract ' I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. 像这样,有很多这样的标签,每个标签在title =“我需要提取的某些文本”后都有不同的文本,我想从html源/ txt文件中提取文本的所有此类实例,并写入单独的txt文件。 The text can contain lower case, upper case letters and number only. 文本只能包含小写字母,大写字母和数字。 The length of each text string(in characters) will vary. 每个文本字符串的长度(以字符为单位)会有所不同。

I am using PowerGrep for windows. 我正在Windows中使用PowerGrep。 Powergrep allows me to search a text file with regular expression inout. Powergrep允许我搜索带有正则表达式inout的文本文件。 I tried using the search as title='[a-zA-Z0-9] 我尝试将搜索用作title ='[a-zA-Z0-9]

It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string. 它显示正确的匹配项,但仅匹配字符串的第一个字符,并且仅写入与第二个txt文件匹配的文本字符串的第一个字符,而不是所有字符串。

I want all string to be matched and written to the second file. 我希望所有字符串都匹配并写入第二个文件。

What is the correct regular expression or way to do what i want to do, using powergrep? 使用powergrep,正确的正则表达式或执行我想做的事情的方法是什么?

-AD. -广告。

I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). 我只是不确定要问多少次HTML文件的正则表达式解析问题(并使用“使用DOM解析器”的正确解决方案来回答)。 It comes up every day. 它每天都会出现。

The difficulties are: 困难是:

  • In HTML attributes can have single-quotes, double-quotes or even no quotes; 在HTML中,属性可以有单引号,双引号或什至没有引号。
  • Similar strings can appear in the HTML document itself; 类似的字符串可以出现在HTML文档本身中。
  • You have to handle correct escaping; 您必须处理正确的转义; and
  • Malformed HTML (decent parsers are extremely robust to common errors). 格式不正确的HTML(正确的解析器对于常见错误极为健壮)。

So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%. 因此,如果您满足所有这些要求(并且它变得非常复杂,但仍不完美的正则表达式),那么它仍然不是100%。

HTML parsers exist for a reason. HTML解析器的存在是有原因的。 Use them. 使用它们。

The other answers all give correct changes to the regex, so I'll explain what the issue was with your original. 其他答案都对正则表达式进行了正确的更改,因此,我将解释您的原始问题是什么。

The square brackets indicate a character class - meaning that the regex will match any character within those brackets. 方括号表示字符类别 -表示正则表达式将匹配这些括号内的任何字符。 However, like everything else, it will only match it once by default. 但是,与其他所有内容一样,默认情况下它只会匹配一次。 Just as the regex " s " would match only the first character in " ssss ", the regex " [a-zA-Z0-9] " will match only the first character in " Connectivity Framework ". 就像正则表达式“ s ”仅匹配“ ssss ”中的第一个字符一样,正则表达式“ [a-zA-Z0-9] ”将仅匹配“ Connectivity Framework ”中的第一个字符。

By adding repetition , one can get that character class to match repeatedly. 通过添加重复 ,可以使该字符类重复匹配。 The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). 最简单的方法是在其后添加一个星号(它将匹配0个或多个事件)。 Thus the regex " [a-zA-Z0-9] *" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets). 因此,正则表达式“ [a-zA-Z0-9] *”将连续匹配多个字符,直到找到不属于该字符类的字符为止(在您的情况下为空格字符,因为您未包括该字符)在括号中)。

Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? 尽管正则表达式要准确地描述语法可能非常复杂-如果有人在属性中放置非字母数字字符(例如&符)怎么办? You could try to capture all input between the quotes by making the character set "anything except a quote character", so " '[^']*' " would usually do the right thing. 您可以通过将字符集设置为“除引号字符之外的任何字符”来捕获引号之间的所有输入,因此“ '[^']*' ”通常可以做正确的事情。 Often you need to bear in mind escaping as well (eg with a string 'Mary\\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs. 通常,您也需要记住转义符(例如,使用字符串'Mary\\'s lamb'您实际上确实想捕获中间的撇号,因此简单的“除了撇号”的字符集不会删减它)幸运的是,根据规范,这不是XML / HTML的问题。

Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible. 尽管如此,如果有一个现有的库可以为您提取数据,那么它可能比滚动自己的库更快,更正确,因此,如果可能的话,我会倾向于这样做。

I'm not familiar with PowerGrep, however, your regex is incomplete. 我对PowerGrep不熟悉,但是您的正则表达式不完整。 Try this: 尝试这个:

title='[a-zA-Z0-9 ]*'

or better yet: 或更好:

title='([^']*)'

I would use this regular expression to get the title attribute values 我将使用此正则表达式来获取title属性值

<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)

Note that this regex matches the attribute value expression with quotes. 请注意,此正则表达式将属性值表达式与引号匹配。 So you have to remove them if needed. 因此,如果需要,您必须将其删除。

Here's the regex you need 这是您需要的正则表达式

title='([a-zA-Z0-9]+)'

but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful. 但是如果您打算做更多这样的事情,使用解析器可能会使它更加健壮和有用。

尝试以下方法:

title=\'[a-zA-Z0-9]*\'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM