简体   繁体   中英

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

I'm having a problem with some webscraping code that I'm trying to run. To scrape information from a series of links like the following:

http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument

I am trying to scrape certain elements from the table, but I received the following error:

Python Error: 'NoneType' object has no attribute 'find_all'

I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table. I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working? I'm new to webscraping but I'd appreciate any help!

So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure. You have to fix it before parse it:

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')

table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup , you will find that the document has some parts removed by parser.

However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:

rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

that should catch the table tag correctly


import pandas as pd

df = pd.read_html(
    "http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]

print(df)
df.to_csv("Data.csv", index=False, header=None)

Output: view online

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM