简体   繁体   中英

How can I extract url links from IGN website

I am trying to extract the urls of the reviews on this webpage http://uk.ign.com/games/reviews then open the top 5 in separate tabs

Right now, I have attempted different selections to try pick up the right data but nothing seems to be returning anything. I can't seem to get beyond extracting the urls of each review in the list, let alone opening the first 5 in separate tabs.

I am using Python 3 with Python IDE

Here is my code:

import webbrowser, bs4, requests, re

webPage = requests.get("http://uk.ign.com/games/reviews", headers={'User-
Agent': 'Mozilla/5.0'})

webPage.raise_for_status()

webPage = bs4.BeautifulSoup(webPage.text, "html.parser")

#Me trying different selections to try extract the right part of the page 
webLinks = webPage.select(".item-title")
webLinks2 = webPage.select("h3")
webLinks3 = webPage.select("div item-title")

print(type(webLinks))
print(type(webLinks2))
print(type(webLinks3))
#I think this is where I've gone wrong. These all returning empty lists. 
#What am I doing wrong?


lenLinks = min(5, len(webLinks))
for i in range(lenLinks):
    webbrowser.open('http://uk.ign.com/' + webLinks[i].get('href'))

Using bs4, BeautifulSoup, and the soup object it returns (which you have as webPage , you can call:

webLinks = webPage.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find_all returns a list of elements based on their title (in your case, a. These are the HTML elements; to get the links you need to go a step further. You can access an HTML element's attributes (in your case, you want the href ) like you would a dict :

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

See BeautifulSoup getting href for more details. Or of course, the docs

ps python is typically written with snake_case rather than CamelCase :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM