[英]Scraping a page for URLs using Beautifulsoup
我可以把頁面刮到頭條新聞,沒問題。 URL是另一個故事。 它們是附加在基本URL末尾的片段 - 我理解......我需要以格式提取相關的存儲URL - base_url.scraped_fragment
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import csv
import MySQLdb
import re
html = urlopen("http://advances.sciencemag.org/")
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
#links = soup.findAll("a","href")
headlines = soup.findAll("div", "highwire-cite-title media__headline__title")
for headline in headlines:
text = (headline.get_text())
print text
首先,類名之間應該有一個空格:
highwire-cite-title media__headline__title
HERE^
無論如何,既然你需要鏈接,你應該找到a
元素並使用urljoin()
來創建絕對URL:
from urlparse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = "http://advances.sciencemag.org"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
headlines = soup.find_all(class_="highwire-cite-linked-title")
for headline in headlines:
print(urljoin(base_url, headline["href"]))
打印:
http://advances.sciencemag.org/content/2/4/e1600069
http://advances.sciencemag.org/content/2/4/e1501914
http://advances.sciencemag.org/content/2/4/e1501737
...
http://advances.sciencemag.org/content/2/2
http://advances.sciencemag.org/content/2/1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.