如何从<a>标签中</a>提取 url GET 参数<a>，从完整的 html 文本</a>

Question

So I have an html page.所以我有一个html页面。 It's full of various tags, most of them have sessionid GET parameter in their href attribute.它充满了各种标签，其中大多数在其 href 属性中都有 sessionid GET 参数。 Example:例子：

...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...

So, as you see, sessionid is the same, i just need to get it's value into variable, no matter from which one: x=11692390 I'm newbie in regex, but google wasn't helpful.所以，如您所见，sessionid 是相同的，我只需要将它的值放入变量中，无论来自哪个： x=11692390 我是正则表达式的新手，但谷歌没有帮助。 Thx a lot!多谢！

Answer 1

This does not use regexes, but anyway, this is what you would do in Python 2.6:这不使用正则表达式，但无论如何，这就是您在 Python 2.6 中要做的：

from BeautifulSoup import BeautifulSoup
import urlparse

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)

for link in links:
  href = link['href']
  url = urlparse.urlparse(href)
  params = urlparse.parse_qs(url.query)
  if 'sessionid' in params:
    print params['sessionid'][0]

Answer 2

Parse your HTML with a DOM parsing library and use getElementsByTagName('a') to grab anchors, iterate through them and use getAttribute('href') and then extract the string.使用 DOM 解析库解析 HTML 并使用getElementsByTagName('a')获取锚点，遍历它们并使用getAttribute('href')然后提取字符串。 Then you can use regex or split on ?然后你可以使用正则表达式或拆分? to match/retrieve the session id.匹配/检索会话ID。

Answer 3

I would do this - before I was told it was a python issue ;)我会这样做 - 在我被告知这是一个 python 问题之前；)

<script>
function parseQString(loc) {
  var qs = new Array();
  loc = (loc == null) ? location.search.substring(1):loc.split('?')[1];
  if (loc) {
    var parms = loc.split('&');
    for (var i=0;i<parms.length;i++) {
      nameValue = parms[i].split('=');
      qs[nameValue[0]]=(nameValue.length == 2)? unescape(nameValue[1].replace(/\+/g,' ')):null; // use null or ""
    }
  }
  return qs;
}
var ids = []; // will hold the IDs
window.onload=function() {
  var links = document.links;
  var id;
  for (var i=0, n=links.length;i<n;i++) {
    ids[i] = parseQString(links[i].href)["sessionid"];
  }
  alert(ids); // remove this when happy
  // here you can do 
  alert(ids[3]); 
  //to get the 4th link's sessionid
}


</script>

<a href="struct_view_distrib.asp?sessionid=11692390">
...</a>
<a href="SHOW_PARENT.asp?sessionid=11692390">
...</a>
<a href="nakl_view.asp?sessionid=11692390">
...</a>
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...</a>

Answer 4

下面是一个正则表达式，您可以使用它来匹配 href 并提取其值：

\b(?<=(href="))[^"]*?(?=")

Answer 5

Complete example for Python3, inspired by AbdealiJK:受 AbdealiJK 启发的 Python3 完整示例：

response = """...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
..."""

from bs4 import BeautifulSoup
import urllib.parse
soup = BeautifulSoup(response, "lxml")
for i in soup.find_all('a', href=True):
    try:
        print(urllib.parse.parse_qs(urllib.parse.urlparse(i['href']).query)["sessionid"])
    except:
        pass

Answer 6

bs4 4.7.1.+ has all the functionality you need for this. bs4 4.7.1.+ 具有您需要的所有功能。 Use css AND syntax combined with :not to specify url with param sessionid only and select_one to limit to first match, then split on that param and grab the ubound array value使用 css AND 语法结合:not仅指定带有参数 sessionid 的 url 和 select_one 以限制为第一个匹配，然后在该参数上拆分并获取 ubound 数组值

soup.select_one("[href*='asp?sessionid']:not([href*='&'])")['href'].split('sessionid=')[-1]

如何从<a>标签中</a>提取 url GET 参数<a>，从完整的 html 文本</a>

问题描述

6 个解决方案

解决方案1
10 已采纳 2010-08-17 09:24:45

解决方案2
5 2010-08-17 09:09:49

解决方案3
2 2010-08-17 09:17:53

解决方案4
1 2010-08-17 09:12:00

解决方案5
1 2019-11-07 07:27:18

解决方案6
1 2019-11-07 08:09:06

如何从<a>标签中</a>提取 url GET 参数<a>，从完整的 html 文本</a>

问题描述

6 个解决方案

解决方案1 10 已采纳 2010-08-17 09:24:45

解决方案2 5 2010-08-17 09:09:49

解决方案3 2 2010-08-17 09:17:53

解决方案4 1 2010-08-17 09:12:00

解决方案5 1 2019-11-07 07:27:18

解决方案6 1 2019-11-07 08:09:06

解决方案1
10 已采纳 2010-08-17 09:24:45

解决方案2
5 2010-08-17 09:09:49

解决方案3
2 2010-08-17 09:17:53

解决方案4
1 2010-08-17 09:12:00

解决方案5
1 2019-11-07 07:27:18

解决方案6
1 2019-11-07 08:09:06