简体   繁体   English

BeautifulSoup 不会抓取页面上的所有图像

[英]BeautifulSoup doesn't crawl all images on page

I want to crawl google images in google colab to train a tf model.我想在 google colab 中抓取 google 图像来训练 tf model。

The script剧本

doc = BeautifulSoup(requests.get("https://www.google.com/search?q=dog&tbm=isch").text, "html.parser")
all_imgs = [[image.load_img(tf.keras.utils.get_file("images",e.attrs["src"]),target_size=[90,90]),e.attrs["src"][-9:]] for e in doc.select("img")[1:]]
for e in all_imgs:
    plt.figure()
    plt.imshow(e[0])
    plt.title(e[1])
    plt.show()

Explanation:解释:
doc is the parsed html code doc 是解析后的 html 代码
all_imgs is a list with the following format [[img,end_of_img_link],[img,end_of_img_link],...] all_imgs 是具有以下格式的列表[[img,end_of_img_link],[img,end_of_img_link],...]

The problem is that the output is the same image over and over again.问题是 output 一遍又一遍地是同一个图像。
Even if I change the url to crawl imgs of cats like search?q=cat it still shows the same image of a dog!即使我将 url 更改为像search?q=cat那样抓取猫的 imgs,它仍然显示相同的狗图像!

What is the problem?问题是什么?

EDIT: I figured out that the list consists of many copies of the same image, so the problem is the fault of BeautifulSoup not matplotlib编辑:我发现该列表包含同一图像的许多副本,所以问题是 BeautifulSoup 不是 matplotlib 的故障

I found the solution:我找到了解决方案:
All the pictures had the name "images".所有图片都有名称“图像”。
Because of that the first loaded picture did not get overwritten.因此,第一张加载的图片没有被覆盖。 I had to do: .rm -rf /root/.keras/datasets/ in the notebook to delete the folder with the saved image.我必须这样做: .rm -rf /root/.keras/datasets/在笔记本中删除保存图像的文件夹。

Now I am going to name the saved picture differently.现在我要以不同的方式命名保存的图片。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM