我怎样才能在 BeautifulSoup 中只有来自“#0-9 和 AZ”的 select 个链接？

Question

我的 URL 是这个

https://en.wikipedia.org/wiki/List_of_South_Korean_dramas

这适用于选择从 A 到 Z 的所有链接。

 link = s.get(url)
    link_soup = BeautifulSoup(link.text, 'lxml')
    links = (
        link_soup
        .select_one('#A')
        .parent
        .find_next_sibling("ul")
        .find_all("a", href=True)
    )

但是当我尝试选择 select_one #0-9

....

 link_soup
        .select_one('#0-9')
        .parent
        .find_next_sibling("ul")
        .find_all("a", href=True)
    )

我收到这个错误

SelectorSyntaxError: Malformed id selector at position 0
  line 1:
#0-9
^

我怎样才能 select 只有来自“#0-9 和 AZ”的链接？ 我知道我可以只使用 for 循环并使用 re 更改 URL 的结尾并从那里手动抓取链接但是有没有办法使用 select 或 bs4 获得相同的结果。

再次感谢您的帮助。

Answer 1

要回答直接问题，您可以使用 attribute = value css 选择器来指定 id 属性及其值。 数字在 "" 之内，因此不会对解析器造成问题。

link_soup.select('[id="0-9"]')

或者使用其 Unicode 代码点转义前导数字（在这种情况下不需要后续空格，可以缩写为 \30）

link_soup.select('#\\30-9')

但是，您可以指定一个模式来提取一个 go 中的所有链接，而无需额外的 DOM 上下遍历。

links = ['https://en.wikipedia.org' + i['href'] for i in link_soup.select('h2:not(:has(#See_also)) + ul a')]

我怎样才能在 BeautifulSoup 中只有来自“#0-9 和 AZ”的 select 个链接？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-06-19 04:57:00

我怎样才能在 BeautifulSoup 中只有来自“#0-9 和 AZ”的 select 个链接？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-06-19 04:57:00

解决方案1
1 已采纳 2022-06-19 04:57:00