简体   繁体   English

使用 Beautiful Soup Python 进行网页抓取

[英]Web scraping with Beautiful Soup Python

I am currently trying to web scrape a product name out of a website, however, the text is contained within a tag I have never seen before, and hence do not know how to get the text out.我目前正在尝试从网站上抓取产品名称,但是,文本包含在我以前从未见过的标签中,因此不知道如何获取文本。

<h1 class="protect" data-category="Jackets" data-ino="SS18J42" data-
rd="02/22/2018" data-rw="1SS18" itemprop="name">Gradient Puffy 
Jacket</h1>

You can use BeautifulSoup : 您可以使用BeautifulSoup

from bs4 import BeautifulSoup as soup
s = """
 <h1 class="protect" data-category="Jackets" data-ino="SS18J42" data-
rd="02/22/2018" data-rw="1SS18" itemprop="name">Gradient Puffy Jacket</h1>
"""
new_s = soup(s, 'lxml').find('h1', {'itemprop':'name'}).text

Output: 输出:

u'Gradient Puffy Jacket'

Adding to Ajax1234's answer. 添加到Ajax1234的答案。 If you are searching via other html attributes: 如果要通过其他html属性进行搜索:

from bs4 import BeautifulSoup
s = """
<h1 class="protect" data-category="Jackets" data-ino="SS18J42" data-
rd="02/22/2018" data-rw="1SS18" itemprop="name">Gradient Puffy Jacket</h1>
"""
soup = BeautifulSoup(s, 'html.parser')

print(soup.find('h1', {'class': 'protect'}).text)
print(soup.find('h1', {'data-category': 'Jackets'}).text)
print(soup.find('h1', {'data-ino': 'SS18J42'}).text)

etc... 等等...

BeautifulSoup allows you to access elements using attributes, so you could use the following approach: BeautifulSoup允许您使用属性访问元素,因此可以使用以下方法:

from bs4 import BeautifulSoup

html = """<h1 class="protect" data-category="Jackets" data-ino="SS18J42" data-
rd="02/22/2018" data-rw="1SS18" itemprop="name">Gradient Puffy 
Jacket</h1>"""

soup = BeautifulSoup(html, "html.parser")
print soup.h1.text

With the help of find in beautifulsoup 借助在beautifulsoupfind

soup = BeautifulSoup(html, "html.parser")
print soup.find('h1',{'class':'protect'}).text

We can easily find our required data using below two methods both are working well.我们可以使用以下两种方法轻松找到所需的数据,这两种方法都运行良好。

More ino please read documentations更多 ino 请阅读文档

First method第一种方法

from bs4 import BeautifulSoup as soup
html = """
<div itemscope>  <p itemprop="a">1</p>
"""
src = soup(html, 'lxml').find('p', {'itemprop':'a'}).text
print(src)

output - 1输出 - 1

Second method第二种方法

from bs4 import BeautifulSoup
s = """
<a class="doctor-name" itemprop="name" href="/doctors/gastroenterologists       /required-code-1689679557">Required-code-or-output</a>
"""
soup = BeautifulSoup(s, 'html.parser')

print(soup.find('a', {'class': 'doctor-name'}).text)
print(soup.find('a', {'itemprop': 'name'}).text)

output - Required-code-or-output输出 - 必需的代码或输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM