使用BeautifulSoup进行基本的Python Web抓取

Question

我对编码还很陌生，最近我开始研究网络抓取。 我一直在关注本教程并阅读BS4文档，但我看不出为什么我的代码无法正常工作。

我正在尝试使用网络爬虫提取此帖子的标题，但似乎找不到与“（'div'，class _ ='header'）”匹配的任何标签

我的代码：

import requests
from bs4 import BeautifulSoup

SOURCE = requests.get('http://coreyms.com/').text
SOUP = BeautifulSoup('SOURCE', 'lxml')

HEADER = SOUP.find('div', class_='header')
HEADLINE = HEADER.h2.a.href

print(HEADLINE)

错误信息：

Traceback (most recent call last):
   File "WSCoreySchafer.py", line 10, in <module>
    HEADLINE = ARTICLE.h2.a.href
AttributeError: 'NoneType' object has no attribute 'h2'

Answer 1

该行：

SOUP = BeautifulSoup('SOURCE', 'lxml')

尝试从字符串'SOURCE'创建汤对象，而不是从变量SOURCE存储的值创建汤对象。

您还正在html中寻找错误的元素，不需要具有class="header"的<div> ，实际上是在寻找<header>元素（此页面上有多个元素）。 我实际上建议您使用class="entry-title"寻找<h2>元素，您可以这样：

import requests
from bs4 import BeautifulSoup

SOURCE = requests.get('http://coreyms.com/').text
SOUP = BeautifulSoup(SOURCE, 'lxml')

HEADER = SOUP.find('h2', class_='entry-title')
headline_href = HEADER.a['href']
print(headline_href)

哪个打印

http://coreyms.com/development/best-sublime-text-features-and-shortcuts

使用BeautifulSoup进行基本的Python Web抓取

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-09-10 21:29:43

使用BeautifulSoup进行基本的Python Web抓取

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-09-10 21:29:43

解决方案1
3 已采纳 2018-09-10 21:29:43