简体   繁体   English

如何从HTML提取URL

[英]How to extract URL from an HTML

I'm a newbie in web scraping. 我是网络抓取的新手。 I do as below 我做如下

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = soup.find_all('a', {'href': re.compile("r'\b?20\b'")})
print (res)

and get 并得到

[]

My goal is this fragment 我的目标是这个片段

<script language="javascript" type="text/javascript">
cont = new Array();
count = new Array();
for (i=1979; i <=2015; i++){count[i]=0};
cont[1979] =    "<li><a href='?1979_1#24jan'>24 января</a>" +  

.............. ..............

cont[2016] =    "<li><a href='?2016/2016_spr#cur'>Весенняя серия</a>" +
        "<li><a href='?2016/2016_sum#cur'>Летняя серия</a>" +
        "<li><a href='?2016/2016_aut#cur'>Осенняя серия</a>" +
        "<li><a href='?2016/2016_win#cur'>Зимняя серия</a>";

And i try to get the result like this 我试图得到这样的结果

'?2016/2016_spr#cur' 
'?2016/2016_sum#cur'
'?2016/2016_aut#cur'
'?2016/2016_win#cur'

From 2000 to this moment (so '20' in "r'\\b?20\\b'" is for this reason). 从2000年到现在(因此, "r'\\b?20\\b'"就是这个原因)。 Can you help me, please? 你能帮我吗?

Preliminaries: 预备赛:

>>> import requests
>>> import bs4
>>> page = requests.get('http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')

Having done this it might seem that the most straightforward way of identifying the script element might be to use this: 完成此操作后,似乎似乎最容易识别script元素的方法是使用以下方法:

>>> scripts = soup.findAll('script', text=bs4.re.compile('cont = new Array();'))

However, scripts proves to be an empty list. 但是, scripts证明是一个空列表。 (I don't know why.) (我不知道为什么。)

The basic approach works, if I choose a different target within the script but it would appear the it's unsafe to depend on the exact formatting of the content of Javascript script element. 如果我在脚本中选择了一个不同的目标,则基本方法行得通,但是看起来依赖于Javascript脚本元素的内容的确切格式是不安全的。

>>> scripts = soup.find_all(string=bs4.re.compile('i=1979'))
>>> len(scripts)
1

Still, this might be good enough for you. 不过,这可能对您已经足够了。 Please just notice that the script has the change function at the end to be discarded. 请注意,该脚本最后具有change功能,将被丢弃。

A safer approach might be to look for the containing table element, then the second td element within that and finally the script within that. 一种更安全的方法可能是先查找包含table元素,然后在其中table第二个td元素,最后在其中查找script

>>> table = soup.find_all('table', class_='common_table')
>>> tds = table[0].findAll('td')[1]
>>> script = tds.find('script')

Again, you will need to discard function change . 同样,您将需要放弃功能change

You can use get('attribute') and then filter the results if needed: 您可以使用get('attribute') ,然后根据需要过滤结果:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = [link.get('href') for link in soup.find_all('a')]
print (res)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM