简体   繁体   中英

How to crawl in Python while the website blocked contents not to be crawled?

I am a very beginner of Python and tried to crawl using BeautifulSoup. And tried to crawl a website for collecting product information.

pr_url = soup.findAll("li", {"class", "_3FUicfNemK"})
pr_url

Everything is same with the other codes of crawl using BeautifulSoup. But the problem is nothing happened even if I wrote down right components.

So what I thought is the host blocked the product area not to be crawled. Cuz every element except for the area is crawl-able.

Do you know how to crawl this blocked area? The site url is: https://shopping.naver.com/living/homeliving/category?menu=10004487&sort=POPULARITY

Thank you for your comments in advance!

Notice how when you first load the page the outline of the site loads but the products take a while to load up? This is because the site is requesting the rest of the content to load in the background. This content isn't blocked, it's simply loaded later :)

2 options here imo..

1) Figure out the background request and pass that into beautifulsoup. Using the Chrome dev tools network tab I can see that the request for the products is...

https://shopping.naver.com/v1/products? nc =1583366400000&subVertical=HOME_LIVING&page=1&pageSize=10&sort=POPULARITY&filter=ALL&displayType=CATEGORY_HOME&includeZzim=true&includeViewCount=true&includeStoreCardInfo=true&includeStockQuantity=false&includeBrandInfo=false&includeBrandLogoImage=false&includeRepresentativeReview=false&includeListCardAttribute=false&includeRanking=false&includeRankingByMenus=false&includeStoreCategoryName=false&menuId=10004487&standardSizeKeys=&standardColorKeys=&attributeValueIds=&attributeValueIdsAll=&certifications=&menuIds=&includeStoreInfoWithHighRatingReview=false

Should be able to guess the tweaks to the query string here and use that.

2) Use a tool like Selenium which interacts with the browser and will execute any JavaScript for you so you don't have to figure out that side of things. If you're new to this stuff, might be less of a learning curve into web tech here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM