Python 错误：'NoneType' object 没有使用 Beautiful Soup 的属性 'find_all'

Question

I'm having a problem with some webscraping code that I'm trying to run.我在尝试运行一些网页抓取代码时遇到问题。 To scrape information from a series of links like the following:从一系列链接中抓取信息，如下所示：

http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument

I am trying to scrape certain elements from the table, but I received the following error:我正在尝试从表中抓取某些元素，但收到以下错误：

Python Error: 'NoneType' object has no attribute 'find_all'

I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:我知道这与它实际上并没有找到表有关，因为当我运行以下简化代码时：

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table.它为打印的表格返回“无”，这意味着代码无法抓取表格的任何特征。 I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working?我一直在为类似的页面运行类似的代码，并且我能够很好地找到表格，所以我不确定为什么这不起作用？ I'm new to webscraping but I'd appreciate any help!我是网络抓取的新手，但我会很感激任何帮助！

Answer 1

So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure.所以汤没有正确解析网站内容，因为一个标签不正确，破坏了结构。 You have to fix it before parse it:您必须在解析之前修复它：

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')

table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

Answer 2

I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup , you will find that the document has some parts removed by parser.我认为 html 包含一些缺陷，导致 html 解析器无法正确解析您的soup ，您可以通过打印page.text来验证您会发现某些部分已被打印

However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:然而，lxml 解析器成功地解析了它的缺陷，因为lxml在格式错误的 html 文档上效果更好：

rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

that should catch the table tag correctly应该正确捕获表格标签

Answer 3


import pandas as pd

df = pd.read_html(
    "http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]

print(df)
df.to_csv("Data.csv", index=False, header=None)

Output: view online Output：在线查看

Python 错误：'NoneType' object 没有使用 Beautiful Soup 的属性 'find_all'

问题描述

3 个解决方案

解决方案1
1 2020-04-17 23:02:49

解决方案2
1 已采纳 2020-04-17 23:03:34

解决方案3
0 2020-04-17 23:17:22

Python 错误：'NoneType' object 没有使用 Beautiful Soup 的属性 'find_all'

问题描述

3 个解决方案

解决方案1 1 2020-04-17 23:02:49

解决方案2 1 已采纳 2020-04-17 23:03:34

解决方案3 0 2020-04-17 23:17:22

解决方案1
1 2020-04-17 23:02:49

解决方案2
1 已采纳 2020-04-17 23:03:34

解决方案3
0 2020-04-17 23:17:22