BeautifulSoup 不会抓取页面上的所有图像

Question

I want to crawl google images in google colab to train a tf model.我想在 google colab 中抓取 google 图像来训练 tf model。

The script剧本

doc = BeautifulSoup(requests.get("https://www.google.com/search?q=dog&tbm=isch").text, "html.parser")
all_imgs = [[image.load_img(tf.keras.utils.get_file("images",e.attrs["src"]),target_size=[90,90]),e.attrs["src"][-9:]] for e in doc.select("img")[1:]]
for e in all_imgs:
    plt.figure()
    plt.imshow(e[0])
    plt.title(e[1])
    plt.show()

Explanation:解释：
doc is the parsed html code doc 是解析后的 html 代码
all_imgs is a list with the following format [[img,end_of_img_link],[img,end_of_img_link],...] all_imgs 是具有以下格式的列表[[img,end_of_img_link],[img,end_of_img_link],...]

The problem is that the output is the same image over and over again.问题是 output 一遍又一遍地是同一个图像。
Even if I change the url to crawl imgs of cats like search?q=cat it still shows the same image of a dog!即使我将 url 更改为像search?q=cat那样抓取猫的 imgs，它仍然显示相同的狗图像！

What is the problem?问题是什么？

EDIT: I figured out that the list consists of many copies of the same image, so the problem is the fault of BeautifulSoup not matplotlib编辑：我发现该列表包含同一图像的许多副本，所以问题是 BeautifulSoup 不是 matplotlib 的故障

Answer 1

I found the solution:我找到了解决方案：
All the pictures had the name "images".所有图片都有名称“图像”。
Because of that the first loaded picture did not get overwritten.因此，第一张加载的图片没有被覆盖。 I had to do: .rm -rf /root/.keras/datasets/ in the notebook to delete the folder with the saved image.我必须这样做： .rm -rf /root/.keras/datasets/在笔记本中删除保存图像的文件夹。

Now I am going to name the saved picture differently.现在我要以不同的方式命名保存的图片。

BeautifulSoup 不会抓取页面上的所有图像

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-09 13:23:49

BeautifulSoup 不会抓取页面上的所有图像

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-09 13:23:49

解决方案1
0 已采纳 2020-07-09 13:23:49