简体   繁体   English

如何从此HTML标签提取URL?

[英]How to extract the URL from this HTML tag?

I'm trying to get all URLs with id='revSAR' from the HTML tag below, using a Python regex: 我正在尝试使用Python正则表达式从下面的HTML标记中获取所有id='revSAR' URL:

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

I tried the code below, but it's not working (it prints nothing): 我尝试了下面的代码,但是不起作用(它什么也不打印):

regex = b'<a id="revSAR" href="(.+?)" class="txtsmall noTextDecoration">(.+?)</a>'
pattern=re.compile(regex)
rev_url=re.findall(pattern,txt)
print ('reviews url: ' + str(rev_url))

You could try something like 您可以尝试类似

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input)
print url

However, personally I'd rather use a HTML parsing library like BeautifulSoup for a task like this. 但是,我个人更愿意将HTML解析库(例如BeautifulSoup)用于此类任务。

您无需匹配那些不必要的部分,例如id=...href=... ,请尝试以下操作:

regex = 'http://.*\\'\\s+'

First, why your regex didn't worked? 首先,为什么您的正则表达式不起作用? In your html the attributes are quoted using single quotes where as in regex its double quotes. 在您的html中,属性用单引号引起来,而在regex中则用双引号引起。 And you only need to care about href attribute. 而且您只需要关心href属性。 Try some thing as href=['"](.+?)['"] as regex and it would be better if you use ignore case switch 尝试使用诸如href=['"](.+?)['"]作为正则表达式的东西,如果使用ignore case switch会更好

But again its a very bad decision to parse the html using regex. 但是同样,使用正则表达式解析html是一个非常糟糕的决定。 Please go through this 请通过这个

Description 描述

This exprssion will: 此表示将:

  • find anchor tags 查找锚标签
  • require the anchor tag to have the id attribute with value revSAR 要求锚标记具有值为revSAR的id属性
  • will capture the href attribute value, not including any surrounding quotes if they exist 将捕获href属性值,不包括周围的引号(如果存在)
  • will capture the inner text, and trim the white space 将捕获内部文本并修剪空白
  • will allow the attributes to appear in any order 将允许属性以任何顺序出现
  • allow attributes to have double quoted, single quotes, or no quotes 允许属性具有双引号,单引号或不带引号
  • avoid many of the edge cases which frequently trip up regular expressions when pattern matching html 避免在模式匹配html时经常跳出正则表达式的许多极端情况

<a(?=\\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?\\sid=(['"]?)revSAR\\1(?:\\s|>)) (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\\s>]*)*?\\shref=(['"]?)(.*?)\\2(?:\\s|>))(?:[^>=]|='(?:[^']|\\\\')*'|="(?:[^"]|\\\\")*"|=[^'"][^\\s>]*)*>\\s*(.*?)\\s*<\\/a>

在此处输入图片说明

Examples 例子

Live Demo 现场演示

Sample Text 示范文本

Note the first couple anchor tags here have some really difficult edge cases. 请注意,这里的前几个锚定标签有一些非常困难的边缘情况。

<a onmouseover=' id="revSAR" ; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; '  href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  You shouldn't find me
</a>



<a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>


<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>

Matches 火柴

Group 0 gets the entire anchor tag 组0获取整个锚标签
Group 1 gets the quote surrounding the id attribute which is used later to find the correct closing quote 第1组获取围绕id属性的报价,该报价稍后用于查找正确的结束报价
Group 2 gets the quote surrounding the href attribute which is used later to find the correct closing quote 第2组获取围绕href属性的报价,该报价稍后用于查找正确的结束报价
Group 3 gets the href attribute value, not including any quotes Group 4 gets the inner text, not including any surrounding whitespace 组3获取href属性值,不包括任何引号组4获取内部文本,不包括任何周围的空格

[0][0] = <a onmouseover=' img = 10; href="http://www.NotYourURL.com" ; if (3 <href&& href="http://www.NotYourURL.com" && 6>3) { funRotate(href) ; } ; ' id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 111 customer reviews
</a>
[0][1] = '
[0][2] = '
[0][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[0][4] = See all 111 customer reviews


[1][0] = <a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'>
  See all 136 customer reviews
</a>
[1][1] = '
[1][2] = '
[1][3] = http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending
[1][4] = See all 136 customer reviews

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM