简体   繁体   English

urlopen无法从Web获取所有数据(python)

[英]urlopen not getting all the data from web (python)

i am trying to download pictures from a site. 我正在尝试从网站下载图片。 I figured out that the problem why i cant find the picture URL is immediatelly in the beginning of the code. 我发现为什么我找不到图片URL的问题直接在代码的开头。

I have a problem with that urlopen is downloading a diffrerent HTML than i get in browser. 我的问题是urlopen下载的HTML与我在浏览器中获得的HTML不同。

The site is here . 该站点在这里 When i look at HTML in browser, i can see this part: 当我在浏览器中查看HTML时,可以看到以下部分:

HTML in browser 浏览器中的HTML

<a href="#" data-trigger="cmg-rotate-big">
            <img src="/image/product/eca412b9-9484-4046-8bee-8400fde1d5fe/?width=400" alt="" data-cm-index="0" style="width: 400px; height: 400px; margin-left: 0px; opacity: 1;">
            <img src="/image/product/014a128e-fa7b-4817-9d76-7bdf296de8de/?width=400" alt="" data-cm-index="1" style="width: 0px; height: 400px; margin-left: 200px; opacity: 0.5;">
          </a>

But by the code 但是通过代码

text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text, "html.parser")
print(soup)

the same part is only 同一部分只是

<a data-trigger="cmg-rotate-big" href="#">
<img alt="" data-cm-index="0" src=""/>
<img alt="" data-cm-index="1" src=""/>
</a>

So i can extract the SRC of the image because its missing.. where is the problem please? 所以我可以提取图像的SRC,因为它丢失了..请问问题出在哪里?

Thank you! 谢谢!

The src href is in there. src href在其中。 No need to simulate javascript. 无需模拟javascript。

import requests
import bs4

url = 'https://ceskamincovna.cz/stribrna-mince-na-kolech---skoda-felicia-proof-1493-11549-d/'

response = requests.get(url) 

soup = bs4.BeautifulSoup(response.text , 'html.parser')
imgs = soup.find_all('img')
for img in imgs:
    if '/image/product/' in img['src']:
        print (img['src'])

Output: 输出:

/image/product/eca412b9-9484-4046-8bee-8400fde1d5fe/?width=250
/image/product/014a128e-fa7b-4817-9d76-7bdf296de8de/?width=250
/image/product/0ec5b392-0f8a-4013-a448-a1b82578c008/?width=250
/image/product/9bc26462-5f11-4994-be6e-fcde1d97c5f3/?width=250
/image/product/7da1f235-f322-4a57-b0ca-07964f0a7d37/?width=250
/image/product/bd781b17-8482-4a4f-80f3-5fa55b9bc4c1/?width=250
/image/product/f5d4ade9-cac0-4c15-a935-da125b408da1/?width=250
/image/product/f4d6fb41-af72-4510-a70c-0a9893656e93/?width=250
/image/product/6136afe7-7444-42cd-858b-af66ca4ca6de/?width=140
/image/product/a459eb25-dd12-446a-9517-341d128c9571/?width=140

If you want the width = 400: 如果您希望宽度= 400:

import requests
import bs4

url = 'https://ceskamincovna.cz/stribrna-mince-na-kolech---skoda-felicia-proof-1493-11549-d/'

response = requests.get(url) 

soup = bs4.BeautifulSoup(response.text , 'html.parser')
imgs = soup.find_all('img')
for img in imgs:
    if '/image/product/' in img['src']:
        print (img['src'].split('?width=')[0] + '?width=400')

Output: 输出:

/image/product/eca412b9-9484-4046-8bee-8400fde1d5fe/?width=400
/image/product/014a128e-fa7b-4817-9d76-7bdf296de8de/?width=400
/image/product/0ec5b392-0f8a-4013-a448-a1b82578c008/?width=400
/image/product/9bc26462-5f11-4994-be6e-fcde1d97c5f3/?width=400
/image/product/7da1f235-f322-4a57-b0ca-07964f0a7d37/?width=400
/image/product/bd781b17-8482-4a4f-80f3-5fa55b9bc4c1/?width=400
/image/product/f5d4ade9-cac0-4c15-a935-da125b408da1/?width=400
/image/product/f4d6fb41-af72-4510-a70c-0a9893656e93/?width=400
/image/product/6136afe7-7444-42cd-858b-af66ca4ca6de/?width=400
/image/product/a459eb25-dd12-446a-9517-341d128c9571/?width=400

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM