简体   繁体   中英

Beautiful Soup is not selecting any element

This is the code I am using to iterate over all elements:

soup_top = bs4.BeautifulSoup(r_top.text, 'html.parser')

selector = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'

for link in soup_top.select(selector):
    print(link)

The same selector gives a length of 57 when used in JavaScript:

document.querySelectorAll("#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a").length;

I thought that maybe I am not getting the contents of the webpage correctly. I then saved a local copy of the webpage but the selector in Beautiful Soup still did not select anything. What is going on here?

This is the website I am using the code on.

It seems that this is due to the parser you used (ie html.parser ). If I try the same thing with lxml as parser:

from bs4 import BeautifulSoup
import requests

url = 'http://www.swapnilpatni.com/law_charts_final.php'
r = requests.get(url)
r.raise_for_status()

soup = BeautifulSoup(r.text, 'lxml')

css_select = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
links = soup.select(css_select)
print('{} link(s) found'.format(len(links)))

>> 1 link(s) found

for link in links:
    print(link['href'])

>> spadmin/doc/Company Law amendment 1.1.png

The html.parser will return a result up until #ContentPlaceHolder1_gvDisplay table tr , and even then it only returns the first tr .

When running the url through W3 Markup Validation Service , this is the error that is returned:

Sorry, I am unable to validate this document because on line 1212 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. The error was: utf8 "\\xA0" does not map to Unicode

It's likely that the html.parser chokes on this as well, while lxml is more fault-tolerant.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM