[英]Scraping an url using BeautifulSoup
Hello I am beginner in data scraping.您好,我是数据抓取的初学者。 At this case I want to get an url like "https:// . . ."在这种情况下,我想获得一个类似“https:// . .”的网址。 but the result is a list in link variable that contain of all links in web.但结果是链接变量中的列表,其中包含 web 中的所有链接。 Here the code below;这是下面的代码;
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
artikel = soup.findAll('div', {'class' : 'list media_rows list-berita'})
p = 1
link = []
for p in artikel:
s = p.findAll('a', href=True)['href']
link.append(s)
the result of the code above is getting error such as上面代码的结果是出现错误,例如
TypeError Traceback (most recent call last)
<ipython-input-141-469cb6eabf70> in <module>
3 link = []
4 for p in artikel:
5 s = p.findAll('a', href=True)['href']
6 link.append(s)
TypeError: list indices must be integers or slices, not str
The result is I want to get all links of https:// .结果是我想获取 https:// 的所有链接。 . . . . in <div class = 'list media_rows list-berita' as a list Thank you in advance.在 <div class = 'list media_rows list-berita' 作为列表提前谢谢。
Code:代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll('div', {'class' : 'list media_rows list-berita'})
links = []
for article in articles:
hrefs = article.find_all('a', href=True)
for href in hrefs:
links.append(href['href'])
print(links)
Output:输出:
['https://news.detik.com/kolom/d-5609578/bahaya-laten-narasi-kpk-sudah-mati', 'https://news.detik.com/berita/d-5609585/penyuap-nurdin-abdullah-tawarkan-proyek-sulsel-ke-pengusaha-minta-rp-1-m', 'https://news.detik.com/berita/d-5609537/7-gebrakan-ahok-yang-bikin-geger', 'https://news.detik.com/berita/d-5609423/ppp-minta-bkn-jangan-asal-sebut-twk-kpk-dokumen-rahasia',
'https://news.detik.com/berita/d-5609382/mantan-sekjen-nasdem-gugat-pasal-suap-ke-mk-karena-dinilai-multitafsir', 'https://news.detik.com/berita/d-5609381/kpk-gali-informasi-soal-nurdin-abdullah-beli-tanah-pakai-uang-suap', 'https://news.detik.com/berita/d-5609378/hrs-bandingkan-kasus-dengan-pinangki-ary-askhara-tuntutan-ke-saya-gila', 'https://news.detik.com/detiktv/d-5609348/pimpinan-kpk-akhirnya-penuhi-panggilan-komnas-ham', 'https://news.detik.com/berita/d-5609286/wakil-ketua-kpk-nurul-ghufron-penuhi-panggilan-komnas-ham-soal-polemik-twk']
There is only one div
with the class list media_rows list-berita
.只有一个div
具有类list media_rows list-berita
。 So you can use find
instead of findAll
所以你可以使用find
而不是findAll
div
with class name list media_rows list-berita
选择带有类名list media_rows list-berita
的div
<a>
with findAll
from the div
.从div
使用findAll
选择所有<a>
。 This will give you a list of all <a>
tags present inside the div
这将为您提供div
中所有<a>
标签的列表<a>
from the above list and extract the href
.遍历上面列表中的所有<a>
并提取href
。Here is a working code.这是一个工作代码。
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
artikel = soup.find('div', {'class' : 'list media_rows list-berita'})
a_hrefs = artikel.findAll('a')
link = []
for k in a_hrefs:
link.append(k['href'])
print(link)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.