简体繁体中英

Best architecture for crawling website in application

原文 2012-08-27 07:57:25 8 2 java/ spring/ hibernate/ architecture

I am working on a product in which we need a feature to crawl the user given URL and publish his separate mobile site. In the crawling process we want to crawl the site content, CSS, images and scripts. The product used to do more activities like scheduling some marketing activities and all. What I want to ask -

what is the best practice and open source framework to do this task?

Should we do it in the application itself or should there be another server for doing this activity (if this activity takes load)? Keep in mind that we have 1 "lacks" user visiting every month publishing his mobile site from the website, and around 1-2k concurrent users.

The application is built in Java and the Java EE platform using Spring and Hibernate as server side technology.

2 answers

We used Derkley DB Java edition for managing off-heap queue of links to crawl and distinguish between links pending download and ones downloaded yet.

For parsing HTML TagSoup is the best choise in wild internet.

Batik is the choice for parsing CSS and SVG.

PDFBox is awesome and allows to extract links from PDF

Quartz scheduler is intustry-proven choice for event scheduling.

And yes, you will need one or more servers for crawling, one server for aggregating results and scheduling tasks, and perhaps another server for WEB front and back end.

This worked well for http://linktiger.com and http://pagefreezer.com

I'm implementing a crawling project based on Selenium HtmlUnit Driver . I think it's really the best Java Framework to automate a headless browser.

Best client/server architecture to develop Facebook application?

What's the best scalable modern architecture for a high volume website (Java)

Application crawling using selenium

Why is website crawling taking forever?

Best way to implement Client <-> Server <-> Database architecture in an Android application?

web crawling from BBC website using XPath

StaleElementReferenceException when crawling a website using Selenium

Can not get all the data when crawling a website

Commercial Website architecture question

Architecture of Website based on java

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Best client/server architecture to develop Facebook application? What's the best scalable modern architecture for a high volume website (Java) Application crawling using selenium Why is website crawling taking forever? Best way to implement Client <-> Server <-> Database architecture in an Android application? web crawling from BBC website using XPath StaleElementReferenceException when crawling a website using Selenium Can not get all the data when crawling a website Commercial Website architecture question Architecture of Website based on java

Related Tags

Best architecture for crawling website in application

Question

2 answers

solution1
1 ACCPTED 2012-08-27 08:07:51

solution2
0 2012-08-27 08:13:29

Best architecture for crawling website in application

Question

2 answers

solution1 1 ACCPTED 2012-08-27 08:07:51

solution2 0 2012-08-27 08:13:29

solution1
1 ACCPTED 2012-08-27 08:07:51

solution2
0 2012-08-27 08:13:29