简体   繁体   English

从结构不同的多个 URL 中抓取信息

[英]Scraping Information from multiple URLS that are different in structure

I would like to scrape multiple URLS but they are of different nature, such as different company websites with different html backend.我想抓取多个 URL,但它们具有不同的性质,例如具有不同 html 后端的不同公司网站。 Is there a way to do it without coming up with a customised code for each url?有没有办法在不为每个 url 提供自定义代码的情况下做到这一点?

Understand that I can put multiple URLS into a list and loop them了解我可以将多个 URL 放入一个列表并循环它们

I fear not, but I am not an expert:-)我不害怕,但我不是专家:-)

I could imagine that it depends on the complexity of the structures.我可以想象这取决于结构的复杂性。 If you want to find a the text "Test" on every website, I coul imagine that soup.body.findAll(text='Test') would return all occurences of "Test" on the website.如果您想在每个网站上找到文本“测试”,我可以想象soup.body.findAll(text='Test')会返回网站上所有出现的“测试”。

I assume you're aware of how to loop through a list here, so that you'd loop through the list of URLS and for each check whether the searched string occurs (maybe you are looking for sth else, ie an "apply" button or "login"?我假设您知道如何在此处循环遍历列表,以便您遍历 URL 列表并检查是否出现了搜索字符串(也许您正在寻找其他东西,即“应用”按钮还是“登录”?

all the best,一切顺利,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM