简体   繁体   中英

BeautifulSoup can't find all tags

My goal is to get number of specific tags from links what I want to scrape. I have inspected manually the number of the tags and my code can't find all of them.

I've tried different parsers like "html.parser", "html5lib" and "lxml" but the fault occurs everytime.

My code:

from bs4 import BeautifulSoup
from selenium import webdriver
urls = ["http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502491&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502451&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502395&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502407&season_id=93783&league_id=4#mbt:2-400$t&0=1"]

for url in urls:
    browser = webdriver.PhantomJS()
    browser.get(url)
    table = BeautifulSoup(browser.page_source, 'lxml')
    print(len(table.find_all("tr", {"class":["row1","row2"]})))

Output:

88
87
86
66
86
59

Goal output:

88
86
87
87
86
83

I basically just added a delay line to your code. This help the program waits until the webpage is totally loaded and ready for parsing using BS4.

Also note that my output is different than your goal output. But I double checked the number of "tr" that contains "row1" and "row2" on each url and it seems that my output is accurate (perhaps the results on the website changed a bit after the time you posted the question).

Code:

import time
from bs4 import BeautifulSoup
from selenium import webdriver

urls = ["http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502491&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502451&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502395&season_id=93783&league_id=4#mbt:2-400$t&0=1",
"http://www.basket.fi/sarjat/ottelu/?game_id=3502407&season_id=93783&league_id=4#mbt:2-400$t&0=1"]

for url in urls:
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(10)
    table = BeautifulSoup(driver.page_source, 'lxml')
    print(len(table.find_all("tr", {"class":["row1","row2"]})))

Output:

88
87
86
87
86
83

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM