[英]How to extract url GET parameter from <a> tag, from the full html text
So I have an html page.所以我有一个html页面。 It's full of various tags, most of them have sessionid GET parameter in their href attribute.
它充满了各种标签,其中大多数在其 href 属性中都有 sessionid GET 参数。 Example:
例子:
...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...
So, as you see, sessionid is the same, i just need to get it's value into variable, no matter from which one: x=11692390 I'm newbie in regex, but google wasn't helpful.所以,如您所见,sessionid 是相同的,我只需要将它的值放入变量中,无论来自哪个: x=11692390 我是正则表达式的新手,但谷歌没有帮助。 Thx a lot!
多谢!
This does not use regexes, but anyway, this is what you would do in Python 2.6:这不使用正则表达式,但无论如何,这就是您在 Python 2.6 中要做的:
from BeautifulSoup import BeautifulSoup
import urlparse
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse.urlparse(href)
params = urlparse.parse_qs(url.query)
if 'sessionid' in params:
print params['sessionid'][0]
Parse your HTML with a DOM parsing library and use getElementsByTagName('a')
to grab anchors, iterate through them and use getAttribute('href')
and then extract the string.使用 DOM 解析库解析 HTML 并使用
getElementsByTagName('a')
获取锚点,遍历它们并使用getAttribute('href')
然后提取字符串。 Then you can use regex or split on ?
然后你可以使用正则表达式或拆分
?
to match/retrieve the session id.匹配/检索会话ID。
I would do this - before I was told it was a python issue ;)我会这样做 - 在我被告知这是一个 python 问题之前;)
<script>
function parseQString(loc) {
var qs = new Array();
loc = (loc == null) ? location.search.substring(1):loc.split('?')[1];
if (loc) {
var parms = loc.split('&');
for (var i=0;i<parms.length;i++) {
nameValue = parms[i].split('=');
qs[nameValue[0]]=(nameValue.length == 2)? unescape(nameValue[1].replace(/\+/g,' ')):null; // use null or ""
}
}
return qs;
}
var ids = []; // will hold the IDs
window.onload=function() {
var links = document.links;
var id;
for (var i=0, n=links.length;i<n;i++) {
ids[i] = parseQString(links[i].href)["sessionid"];
}
alert(ids); // remove this when happy
// here you can do
alert(ids[3]);
//to get the 4th link's sessionid
}
</script>
<a href="struct_view_distrib.asp?sessionid=11692390">
...</a>
<a href="SHOW_PARENT.asp?sessionid=11692390">
...</a>
<a href="nakl_view.asp?sessionid=11692390">
...</a>
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...</a>
下面是一个正则表达式,您可以使用它来匹配 href 并提取其值:
\b(?<=(href="))[^"]*?(?=")
Complete example for Python3, inspired by AbdealiJK:受 AbdealiJK 启发的 Python3 完整示例:
response = """...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
..."""
from bs4 import BeautifulSoup
import urllib.parse
soup = BeautifulSoup(response, "lxml")
for i in soup.find_all('a', href=True):
try:
print(urllib.parse.parse_qs(urllib.parse.urlparse(i['href']).query)["sessionid"])
except:
pass
bs4 4.7.1.+ has all the functionality you need for this. bs4 4.7.1.+ 具有您需要的所有功能。 Use css AND syntax combined with
:not
to specify url with param sessionid only and select_one to limit to first match, then split on that param and grab the ubound array value使用 css AND 语法结合
:not
仅指定带有参数 sessionid 的 url 和 select_one 以限制为第一个匹配,然后在该参数上拆分并获取 ubound 数组值
soup.select_one("[href*='asp?sessionid']:not([href*='&'])")['href'].split('sessionid=')[-1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.