简体   繁体   中英

Best architecture for crawling website in application

I am working on a product in which we need a feature to crawl the user given URL and publish his separate mobile site. In the crawling process we want to crawl the site content, CSS, images and scripts. The product used to do more activities like scheduling some marketing activities and all. What I want to ask -

what is the best practice and open source framework to do this task?

Should we do it in the application itself or should there be another server for doing this activity (if this activity takes load)? Keep in mind that we have 1 "lacks" user visiting every month publishing his mobile site from the website, and around 1-2k concurrent users.

The application is built in Java and the Java EE platform using Spring and Hibernate as server side technology.

We used Derkley DB Java edition for managing off-heap queue of links to crawl and distinguish between links pending download and ones downloaded yet.

For parsing HTML TagSoup is the best choise in wild internet.

Batik is the choice for parsing CSS and SVG.

PDFBox is awesome and allows to extract links from PDF

Quartz scheduler is intustry-proven choice for event scheduling.

And yes, you will need one or more servers for crawling, one server for aggregating results and scheduling tasks, and perhaps another server for WEB front and back end.

This worked well for http://linktiger.com and http://pagefreezer.com

I'm implementing a crawling project based on Selenium HtmlUnit Driver . I think it's really the best Java Framework to automate a headless browser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM