I'm trying to practice using BeautifulSoup
. I am trying to pull the image address of football player images from this website: https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652
When I ' inspect
' the code, the section that has the img
src
is below:
<div class="dataBild">
<img src="https://tmssl.akamaized.net//images/portrait/header/195652-1456301478.jpg?lm=1456301501" title="Jordon Ibe" alt="Jordon Ibe" class="">
<div class="bildquelle"><span title="imago">imago</span></div>
</div>
So I was thinking that I could just use BeautifulSoup
to find the div
with class = "DataBild"
as this is unique.
# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup
# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)
#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")
player_img = soup.find_all('div', {'class':'dataBild'})
print (player_img)
This runs but it doesn't output anything. So I checked by just running print(soup)
# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup
# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)
#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")
print(soup)
This outputs
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr/><center>nginx</center>
</body>
</html>
So it is obviously not pulling all the HTML from the webpage? Why is this happening? And is my logic of looking for div class = DataBild sound
?
The site seems to inspect whether the User-Agent
header of the request is valid.
So you need to add the header like this:
import urllib3
import certifi
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
print(response.status)
This prints 200
. If you remove the headers, you get 404
.
Any non-empty User-Agent
value (after trimming whitespace) seems to work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.