简体繁体 English

搜寻器可以在此服务器配置上工作吗？

[英]Will a crawler work on this server configuration?

原文 2011-04-13 06:04:30 2 2 php/ mysql/ database/ hosting/ web-crawler

I am building a small crawler as a hobby project. 我正在将一个小型爬虫作为一项爱好项目。 All I want to do is crawl around a million pages and store them in a database. 我要做的就是爬上一百万页并将它们存储在数据库中。 (yes it will be updated time to time, but entries at any particular time will be 1 million only) Just to know how these things work. （是的，它会不时地进行更新，但是在任何特定时间的条目将仅为一百万）。

I want to code it in PHP/MySQL. 我想用PHP / MySQL编写代码。 I don't want any search capabilities as I don't have server resources to provide that. 我不需要任何搜索功能，因为我没有服务器资源来提供该功能。 All I want is, I should be able to run few SQL queries on database by myself. 我想要的是，我应该能够自己对数据库运行几个SQL查询。

In database I won't be storing any Page text (that I want to be stored in separate txt files - I don't know if it will be feasible). 在数据库中，我将不会存储任何Page文本（我希望将其存储在单独的txt文件中-我不知道这是否可行）。 Only title, link and some other information will be stored. 仅标题，链接和其他一些信息将被存储。 So basically, if I run a query and it gives me some results I can pull the text data from these files. 因此，基本上，如果我运行查询并给出一些结果，则可以从这些文件中提取文本数据。

Would like to know if this design will be feasible in following environment. 想知道这种设计在以下环境中是否可行。

I will be purchasing a VPS from Linode (512 MB RAM) (I can't go for dedicated server, and shared hosts won't let me do this). 我将从Linode（512 MB RAM）购买一个VPS（我不能购买专用服务器，共享主机也不允许我这样做）。

My Question: Will it be able to sustain this big database (1 million rows) with ability to run queries in batch mode when required. 我的问题：能否在需要时以批处理模式运行查询来维持这个大数据库（100万行）。

Any kind of suggestions welcome. 任何建议都欢迎。 Any other hosting option will also be appreciated. 任何其他托管选项也将不胜感激。

2 个解决方案

Writing a web crawler from scratch is a considerable undertaking, at least if you wish to crawl millions of pages. 从头开始编写Web搜寻器是一项艰巨的任务，至少在您希望搜寻数百万个页面的情况下。 I know this from personal experience on the Heritrix web crawler. 我从Heritrix Web 搜寻器的个人经验中知道这一点。

You may benefit from reading the " Overview of the crawler " chapter from the Heritrix developer guide. 阅读Heritrix开发人员指南中的“搜寻器概述 ”一章，您可能会受益。 That chapter covers the high level design and should help you figure out the basic components of a crawler. 该章涵盖了高级设计，应该可以帮助您了解搜寻器的基本组件。

Simply put this boils down to 'crawl state' and 'processing'. 简单地将其归结为“抓取状态”和“处理”。 The crawl state is URLs you´ve seen, URLs you've crawled etc. While processing covers the fetching of an URL and subsequent processing to extract links, save downloaded data etc. Multiple processing threads are typically run in parallel. 爬网状态是您看到的URL，已爬网的URL等。处理过程包括获取URL以及随后的处理以提取链接，保存下载的数据等。多个处理线程通常并行运行。

You could also try Scrapy . 您也可以尝试Scrapy 。 It's fast, and it'll work fine on a Linode 512M server, but it's written in Python. 它速度很快，并且可以在Linode 512M服务器上正常工作，但是它是用Python编写的。