简体   繁体   English

网络搜寻器如何工作?

[英]How does a web crawler work?

Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. 通过一些基本的网站抓取,我正在尝试准备一个数据库以进行价格比较,以简化用户的搜索体验。 Now, I have several questions: 现在,我有几个问题:

Should I use file_get_contents() or curl to get the contents of the required web page? 我应该使用file_get_contents()还是curl获取所需网页的内容?

$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true); 
$result = curl_exec($res55);

Further, every time I crawl a web page, I fetch a lot of links to visit next. 此外,每次我爬网网页时,我都会获取很多链接以供下一步访问。 This may take a long time (days if you crawl big websites like Ebay). 这可能会花费很长时间(如果您抓取Ebay等大型网站,则需要几天的时间)。 In that case, my PHP code will time-out. 在这种情况下,我的PHP代码将超时。 What should be the automated way to do this? 自动化的方法应该是什么? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution? 有没有办法通过在服务器上进行更改来防止PHP超时,或者还有其他解决方案?

So, in that case my PHP code will time-out and it won't continue that long. 因此,在这种情况下,我的PHP代码将超时,并且不会持续那么长时间。

Are you doing this in the code that's driving your web page? 您是在驱动网页的代码中执行此操作的吗? That is, when someone makes a request, are you crawling right then and there to build the response? 也就是说,当有人提出请求时,您是否正在此时进行爬网并建立响应? If so, then yes there is definitely a better way. 如果是这样,那么肯定会有更好的方法。

If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. 如果您有需要爬网的站点列表,则可以设置计划的作业(例如,使用cron )以运行命令行应用程序(而不是网页)来爬网站点。 At that point you should parse out the data you're looking for and store it in a database. 届时,您应该解析出要查找的数据并将其存储在数据库中。 Your site would then just need to point to that database. 然后,您的站点只需要指向该数据库。

This is an improvement for two reasons: 这有两个方面的改进:

  1. Performance 性能
  2. Code Design 代码设计

Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. 性能:在类似网站的请求/响应系统中,您希望最大程度地减少I / O瓶颈。 The response should take as little time as possible. 响应应该花费尽可能少的时间。 So you want to avoid in-line work wherever possible. 因此,您希望尽可能避免进行在线工作。 By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster). 通过将此过程转移到网站上下文之外的东西上并使用本地数据库,您可以将一系列外部服务调用(速度较慢)转换为单个本地数据库调用(速度更快)。

Code Design: Separation of concerns . 代码设计: 关注点分离 This setup modularizes your code a little bit more. 此设置使您的代码更加模块化。 You have one module which is in charge of fetching the data and another which is in charge of displaying the data. 您有一个模块负责获取数据,另一个模块负责显示数据。 Neither of them should ever need to know or care about how the other accomplishes its tasks. 他们中的任何一个都不需要知道或关心对方如何完成任务。 So if you ever need to replace one (such as finding a better scraping method) you won't also need to change the other. 因此,如果您需要更换其中一种(例如,找到一种更好的刮擦方法),则也不需要更改另一种。

curl is the good options. 卷毛是不错的选择。 file_get_contents is for reading files on your server file_get_contents用于读取服务器上的文件

You can set the timeout in curl to 0 in order to have unlimited timeout. 您可以将curl的超时设置为0,以便拥有无限的超时。 You have to set the timeout on Apache too 您也必须在Apache上设置超时

I recommend curl for reading website contents. 我建议使用curl来阅读网站内容。

To avoid the PHP script timing out, you can use set_time_limit . 为了避免PHP脚本超时,可以使用set_time_limit The advantage of this is that you can set the timeout for every server connection to terminate the script, since calling the method resets the previous timeout countdown. 这样做的好处是,您可以为每个服务器连接设置超时以终止脚本,因为调用该方法将重置先前的超时倒计时。 No time limit will be applied if 0 is passed as the parameter. 如果将0作为参数传递,则没有时间限制。

Alternatively, you can set timeout in the php configuration property max_execution_time , but note that this will apply to all php scripts rather than just the crawler. 另外,您可以在php配置属性max_execution_time中设置超时,但是请注意,这将适用于所有php脚本,而不仅仅是爬虫。

http://php.net/manual/en/function.set-time-limit.php http://php.net/manual/zh/function.set-time-limit.php

I'd opt for cURL since you get much more flexibility and you can enable compression and http keep-alive with cURL. 我选择使用cURL,因为您可以获得更多的灵活性,并且可以使用cURL启用压缩和http keep-alive。

But why re-invent the wheel? 但是为什么要重新发明轮子呢? Check out PHPCrawl . 查看PHPCrawl It uses sockets ( fsockopen ) to download URLs but supports multiple crawlers at once (on Linux) and has a lot of options for crawling that probably meet all of your needs. 它使用套接字( fsockopen )下载URL,但一次支持多个爬网程序(在Linux上),并且有很多爬网选项,可能满足您的所有需求。 They take care of timeouts for you as well and have good examples available for basic crawlers. 他们也为您解决超时问题,并为基本的爬虫提供了很好的示例。

You could reinvent the wheel here, but why not look at a framework like PHPCrawl or Sphider ? 您可以在这里重新发明轮子,但是为什么不看看类似PHPCrawlSphider的框架呢? (although the latter may not be exactly what you're looking for) (尽管后者可能与您要找的不完全一样)

Per the documentation , file_get_contents works best for reading files on the server, so I strongly suggest you use curl instead. 根据文档file_get_contents最适合读取服务器上的文件,因此我强烈建议您改用curl As for fixing any timeout issues, set_time_limit is the option you want. 至于解决任何超时问题, set_time_limit是您想要的选项。 set_time_limit(0) should prevent your script from timing out. set_time_limit(0)应该防止脚本超时。

You'll want to set the timeout in Apache as well, however. 但是,您还要在Apache中设置超时时间。 Edit your httpd.conf and change the line that reads Timeout to Timeout 0 for an infinite timeout. 编辑您的httpd.conf并将读取Timeout的行更改为Timeout 0 ,以实现无限超时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM