I'm trying to scrape rank data from BGG .
The basic structure of the HTML is like:
<table class = "collection_table"> <tbody> <tr></tr> <tr id="row_"></tr> <tr id="row_"></tr> <tr id="row_"></tr> <tr id="row_"></tr> <!--snip--> <tr id="row_"></tr> <tr id="row_"></tr> <tr id="row_"></tr> </tbody> </table>
Note that every row except the first (a header) has the same id, and no extra data to mark it as a unique row.
My (current) code is as follows:
def bgg_scrape_rank_page(browser, bgg_data):
time.sleep(1)
table = browser.find_element_by_xpath("//table[@class='collection_table']/tbody")
row = table.find_element_by_xpath("//tr[@id='row_']")
while row:
rank = row.find_element_by_xpath("//td[1]").text
game_name = row.find_element_by_xpath("//td[3]/div[2]/a").text
game_page = row.find_element_by_xpath("//td[3]/div[2]/a").get_attribute("href")
print rank, game_name, game_page
row = row.find_element_by_xpath("//following-sibling::tr")
I have also tried iterating using
rows = browser.find_elements_by_xpath("/tr[@id='row_']")
for row in rows:
rank = row.find_element_by_xpath("//td[1]").text
game_name = row.find_element_by_xpath("//td[3]/div[2]/a").text
game_page = row.find_element_by_xpath("//td[3]/div[2]/a").get_attribute("href")
print rank, game_name, game_page
The problem is, no matter what I seem to try, I always only get the first row printed out. Just row after row of
1 "Pandemic Legacy: Season 1 https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1".
The problem is in your XPath
: you need to add dot as .//
to point on exact context where you want to apply XPath
instead of just //
that always points on <html>
. So try
def bgg_scrape_rank_page(browser, bgg_data):
time.sleep(1)
table = browser.find_element_by_xpath("//table[@class='collection_table']/tbody")
row = table.find_element_by_xpath(".//tr[@id='row_']")
while row:
rank = row.find_element_by_xpath(".//td[1]").text
game_name = row.find_element_by_xpath(".//td[3]/div[2]/a").text
game_page = row.find_element_by_xpath(".//td[3]/div[2]/a").get_attribute("href")
print rank, game_name, game_page
row = row.find_element_by_xpath(".//following-sibling::tr")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.