Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

Question

I'm having a problem with some webscraping code that I'm trying to run. To scrape information from a series of links like the following:

http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument

I am trying to scrape certain elements from the table, but I received the following error:

Python Error: 'NoneType' object has no attribute 'find_all'

I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table. I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working? I'm new to webscraping but I'd appreciate any help!

Answer 1

So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure. You have to fix it before parse it:

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')

table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

Answer 2

I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup , you will find that the document has some parts removed by parser.

However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:

rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

that should catch the table tag correctly

Answer 3


import pandas as pd

df = pd.read_html(
    "http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]

print(df)
df.to_csv("Data.csv", index=False, header=None)

Output: view online

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

Question

3 answers

solution1
1 2020-04-17 23:02:49

solution2
1 ACCPTED 2020-04-17 23:03:34

solution3
0 2020-04-17 23:17:22

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

Question

3 answers

solution1 1 2020-04-17 23:02:49

solution2 1 ACCPTED 2020-04-17 23:03:34

solution3 0 2020-04-17 23:17:22

solution1
1 2020-04-17 23:02:49

solution2
1 ACCPTED 2020-04-17 23:03:34

solution3
0 2020-04-17 23:17:22