简体   繁体   中英

BeautifulSoup doesn't crawl all images on page

I want to crawl google images in google colab to train a tf model.

doc = BeautifulSoup(requests.get("https://www.google.com/search?q=dog&tbm=isch").text, "html.parser")
all_imgs = [[image.load_img(tf.keras.utils.get_file("images",e.attrs["src"]),target_size=[90,90]),e.attrs["src"][-9:]] for e in doc.select("img")[1:]]
for e in all_imgs:
    plt.figure()
    plt.imshow(e[0])
    plt.title(e[1])
    plt.show()

Explanation:
doc is the parsed html code
all_imgs is a list with the following format [[img,end_of_img_link],[img,end_of_img_link],...]

The problem is that the output is the same image over and over again.
Even if I change the url to crawl imgs of cats like search?q=cat it still shows the same image of a dog!

What is the problem?

EDIT: I figured out that the list consists of many copies of the same image, so the problem is the fault of BeautifulSoup not matplotlib

I found the solution:
All the pictures had the name "images".
Because of that the first loaded picture did not get overwritten. I had to do: .rm -rf /root/.keras/datasets/ in the notebook to delete the folder with the saved image.

Now I am going to name the saved picture differently.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM