简体   繁体   English

从 python 中的特定标签中提取 URL

[英]Extract URLs from specific tags in python

all.全部。 I have an huge html file which contains tags like these:我有一个巨大的 html 文件,其中包含以下标签:

<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">

I need to extract all the urls from this page in python.我需要从 python 中的该页面中提取所有 url。

In a loop:在一个循环中:

  1. Find occurences of <h3 class="r"> one by one.逐一查找<h3 class="r">的出现。

  2. Extract the url提取 url

http://xrayoptics.by.ru/database/misc/goog2text.py I need to re-write this script to extract all the links found on google. http://xrayoptics.by.ru/database/misc/goog2text.py我需要重新编写这个脚本来提取在谷歌上找到的所有链接。

How can i achieve that?我怎样才能做到这一点? Thanks.谢谢。

from BeautifulSoup import BeautifulSoup

html = """<html>
...
<h3 class="r">
<a href="http://en.wikipedia.org/wiki/Digital_Signature_Algorithm" class=l
   onmousedown="return clk(this.href,'','','','6','','0CDEQFjACOAM')">
text</a>
</h3>
...
<h3>Don't find me!</h3>
<h3 class="r"><a>Don't find me!</a></h3>
<h3 class="r"><a class="l">Don't error on missing href!</a></h3>
...
</html>
"""
soup = BeautifulSoup(html)

for h3 in soup.findAll("h3", {"class": "r"}):
  for a in h3.findAll("a", {"class": "l", "href": True}):
    print a["href"]

I'd use XPATH, see here for a question what package would be appropriate in Python.我会使用 XPATH,请参阅此处了解 package 适合 Python 的问题。

You can use a Regular Expressions (RegEx) for that.您可以为此使用正则表达式(RegEx)。 This RegEx will catch all URL's beginning with http and surrounded by quotes ( " ):此 RegEx 将捕获所有以http开头并用引号 ( " ) 括起来的 URL:

http([^\"]+)

And this is how it's done in Python:这就是它在 Python 中的完成方式:

import re
myRegEx = re.compile("http([^\"]+)")
myResults = MyRegEx.search('<source>')

Replace代替by the variable storing the source code you want to search for URL's.通过存储要搜索 URL 的源代码的变量。

myResults.start() and myResults.end() now contain the starting and ending position of the URL's. myResults.start()myResults.end()现在包含 URL 的开始和结束 position。 Use the myResults.group() function to find the string that matched the RegEx.使用myResults.group() function 查找与 RegEx 匹配的字符串。

If anything isn't clear yet, just ask.如果还有什么不清楚的,请问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM