简体   繁体   English

从网页中提取特定的页面链接

[英]Extracting specific page links from a <a href tag using BeautifulSoup

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn 我正在使用BeautifulSoup提取此页面上的所有链接: http : //kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way: 我通过这种方式获取所有这些链接:

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page
uClient = uReq(my_url)

# put all the content in a variable
page_html = uClient.read()

#close the internet connection
uClient.close()

#It does my HTML parser
page_soup = soup(page_html, "html.parser")

# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))

for container in containers:
    link = container
    #start_index = link.index('href="') 
    print(link)
    print("---")
    #print(start_index)

part of my output is: 我的输出的一部分是: 在此处输入图片说明

Notice that it is returning several links but I really want all the ones with >Someting. 请注意,它返回了几个链接,但我真的希望所有带有> Someting的链接。 (For example, ">Allegro" and "Allegro vivace" and so forth). (例如,“> Allegro”和“ Allegro vivace”等)。

I am having a hard time getting the following type of output (example of the image): "Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info " 我很难获得以下类型的输出(图像示例):“ Allegro- http ://kern.ccarh.org/cgi-bin/ksdata? location= users/ craig/ classical/ beethoven/piano/ sonata&file = sonata01-1.krn&format = info

In other words, at this point, I have a bunch of anchor tags (+- 1000). 换句话说,在这一点上,我有一堆锚标记(+-1000)。 From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. 从所有这些标签中,有一堆只是“垃圾”,我想提取+-350个标签。 All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<\\a>" at the end. 所有这些标签看起来几乎相同,但是唯一的区别是我需要的标签的末尾带有“>某人的名字<\\ a>”。 I would like to exctract only the link of all the anchor tags with this characteristic. 我只想吸引所有具有此特征的锚标签的链接。

From what I can see in the image the ones with info have an href attribute containing format="info" so you could use an attribute=value CSS selector of [href*=format="info"] , where the * indicates contains ; 从图像中可以看到,带有info的具有href属性,其中包含format="info"因此您可以使用[href*=format="info"]的attribute = value CSS选择器,其中*表示包含 ; the attribute value contains the substring after the first equals. 属性值包含第一个等于之后的子字符串。

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
    print(link.getText(), link['href'])

The best and easiest way is using text attribute when printing the link. 最好和最简单的方法是在打印链接时使用text属性。 like this : print link.text 像这样: print link.text

Assuming you already have a list of the substrings you need to search for, you can do something like: 假设您已经有了需要搜索的子字符串列表,则可以执行以下操作:

for link in containers:
    text = link.get_text().lower()
    if any(text.endswith(substr) for substr in substring_list):
        print(link)
        print('---')

you want to extract link with specified anchor text? 您要提取具有指定锚文本的链接?

for container in containers:
    link = container
    # match exact
    #if 'Allegro di molto' == link.text:
    if 'Allegro' in link.text: # contain
        print(link)
        print("---")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM