簡體   English   中英

網頁抓取,python和beautifulsoup

[英]Web scraping, python and beautifulsoup

我想從網站上獲得一段,但是我還是用這種方式完成的。 我得到刪除所有html標記的網頁文本,我想找出是否有可能從返回的所有文本中獲取某個段落。

這是我的代碼

import requests
from bs4 import BeautifulSoup

response = requests.get("https://en.wikipedia.org/wiki/Aras_(river)")
txt = response.content

soup = BeautifulSoup(txt,'lxml')
filtered = soup.get_text()
print(filtered)

這是它打印出來的部分文字

>>>>Basin


    Main source
    Erzurum Province, Turkey


    River mouth
    Kura river


    Physical characteristics


    Length
    1,072 km (666 mi)


    The Aras or Araxes is a river in and along the countries of Turkey,     
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus.



    Contents


    1 Names
    2 Description
    3 Etymology and history
    4 Iğdır Aras Valley Bird Paradise
    5 Gallery
    6 See also
    7 Footnotes

我只想得到這一段

    The Aras or Araxes is a river in and along the countries of Turkey,     
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus.

是否可以過濾掉此段?

soup = BeautifulSoup(txt,'lxml')
filtered = soup.p.get_text() # get the first p tag.
print(filtered)

出:

The Aras or Araxes is a river in and along the countries of Turkey, Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser Caucasus Mountains and then joins the Kura River which drains the north side of those mountains. Its total length is 1,072 kilometres (666 mi). Given its length and a basin that covers an area of 102,000 square kilometres (39,000 sq mi), it is one of the largest rivers of the Caucasus.

請改用XPath! 它更容易,更准確,並且專門針對這些用例進行了設計。 不幸的是,BeautifulSoup不直接支持XPath。 您需要改用lxml

import urllib2
from lxml import etree

response = urllib2.urlopen("https://en.wikipedia.org/wiki/Aras_(river)")
parser = etree.HTMLParser()
tree = etree.parse(response, parser)
tree.xpath('string(//*[@id="mw-content-text"]/p[1])')

關於XPath的說明:

//引用文檔中的根元素。

*匹配任何標簽

[@id="mw-content-text"]指定條件。

p[1]選擇容器內類型為p第一個元素。

string函數,為您提供元素的字符串表示形式

順便說一句,如果您使用的是Google Chrome或Firefox,則可以使用$x函數在DevTools中測試XPath表達式:

$x('string(//*[@id="mw-content-text"]/p[1])')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM