从html字符串中删除python regex中不需要的模式

Question

I have to parse the custom vin number from html 我必须从html解析自定义vin号码

I get few wrong matched number also. 我也得到一些错误的匹配号码。

 .....
 <meta name="google-site-verification" content="l4du7Ao9MH6TM2nJ5L54qqWoXWcuOkdeqhXCADgKErc" />
 <meta name="msvalidate.01" content="FAD32C2469C51767894EB50068D37244" />
     .....
  <div class="hproduct auto chevrolet" data-classification="primary" data- vin="3GNDA23D18S647673" data-make="Chevrolet" >

 .....
 </dd></dl><dl class='vin'><dt>VIN:</dt><dd>3GNDA23D18S647673</dd></dl> <span 
 ....... etc....

This is the piece of html which contains required portion 这是包含所需部分的html片段

When I apply my regex in python 当我在python中应用正则表达式时

import re
re.findall("([0-9A-Z]{8}[0-9xX]{1}[1-9A-Y^U]{1}[0-9A-Z]{2}[0-9]{5})",html)

I get the required results along with the unwanted data like 我得到所需的结果以及不需要的数据，例如

['FAD32C2469C517678',
 '3GNDA23D18S647673',
 '3GNDA23D18S647673']

FAD32C2469C517678 is unwanted one. FAD32C2469C517678是不需要的。

How can I get rid of this unwanted patteren in regex in python? 如何在python的正则表达式中摆脱这种不必要的模式？

Answer 1

Please, use a parser: 请使用解析器：

import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//@vin')

out 出

["3GNDA23D18S647673"]

EDIT: if vin is always an attribute, but you don't know the name, you can try: 编辑：如果vin始终是一个属性，但是您不知道名称，则可以尝试：

doc.xpath('.//@*[string-length() = "17"]') # get's attrs with lenght 17

or with regex, if you really,really have to: 或使用正则表达式，如果确实需要，则必须：

import re
re.findall('"([A-Z0-9]{17})"',html)

Answer 2

You should really use an HTML parser but for a quick fix use the regexp (?<=vin=")[^"]+ : 您实际上应该使用HTML解析器，但是要快速解决，请使用regexp (?<=vin=")[^"]+ ：

>>> import re
>>> html = """.....
...  <meta name="google-site-verification" content="l4du7Ao9MH6TM2nJ5L54qqWoXWcuOkdeqhXCADgKErc" />
...  <meta name="msvalidate.01" content="FAD32C2469C51767894EB50068D37244" />
...      .....
...   <div class="hproduct auto chevrolet" data-classification="primary" data- vin="3GNDA23D18S647673" data-make="Chevrolet" >
... 
...  .....
...  </dd></dl><dl class='vin'><dt>VIN:</dt><dd>3GNDA23D18S647673</dd></dl> <span 
...  ....... etc...."""

>>> re.findall('(?<=vin=")[^"]+',html)
['3GNDA23D18S647673']

This uses positive lookbehind to match [^"]+ (one or more characters not a double quote) after the string after vin=" . 这将使用正向后向匹配vin="之后的字符串后的[^"]+ （一个或多个字符而不是双引号） 。

If you want to more strict in your match you could use your regexp in combination with the positive lookbehind: 如果您想更严格地进行比赛，可以将正则表达式与正向外观结合使用：

re.findall('(?<=vin=")[0-9A-Z]{8}[0-9xX]{1}[1-9A-Y^U]{1}[0-9A-Z]{2}[0-9]{5}',html)
['3GNDA23D18S647673']

从html字符串中删除python regex中不需要的模式

问题描述

2 个解决方案

解决方案1
3 2013-01-07 11:50:56

解决方案2
1 2013-01-07 11:49:53

从html字符串中删除python regex中不需要的模式

问题描述

2 个解决方案

解决方案1 3 2013-01-07 11:50:56

解决方案2 1 2013-01-07 11:49:53

解决方案1
3 2013-01-07 11:50:56

解决方案2
1 2013-01-07 11:49:53