I am trying to crawl authenticated web page with Apache Nutch 2.3. I did the following code changes in Nutch 2.3 source code to support POST based authentication
$NUTCH_HOME/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
HttpFormAuthConfigurer.java
and HttpFormAuthentication.java
Copying "plugin.includes"
property in $NUTCH_HOME/conf/nutch-site.xml
<property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
Credential implementation in $NUTCH_HOME/conf/httpclient-auth.xml
</additionalPostHeaders> </removedFormFields-->
I have commented out tag "additionalPostHeaders" and "removedFormFields"
Along with Nutch 2.3 I am using Hbase-0.94.14 and ElasticSearch-1.5.1. I used the following links and discussions to do configuration.
When I am trying to crawl the authenticated page. It is htiting the login URL and I am getting only login URL in elasticsearch result. It is not crawling the content of authenticated page.
Hadoop.log
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication -
Sending 'POST' request to URL : example.com:8080/abc/login.jsp 2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Post parameters : [name=Login, value=Login, name=j_password, value=admin1, name=j_username, value=admin]
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response Code : 200
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : User-Agent: My Nutch Spider/Nutch-2.3
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Connection: keep-alive
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Content-Type: application/x-www-form-urlencoded
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Cookie:
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : User-Agent: My Nutch Spider/Nutch-2.3
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept: text/html,application/xml;q=0.9,application/xhtmlxml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Encoding: x-gzip, gzip, deflate
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Host: example.com:8080
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Cookie: $Version=0; JSESSIONID=j0ojgF0cRVImcD8sYco75F60jr7ooESeVotAGYXLsv-4CqP8!-231988951; $Path=/abc
2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Content-Length: 51
2015-05-05 02:55:26,399 DEBUG httpclient.HttpFormAuthentication - login post result: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2015-05-05 02:55:26,534 INFO regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2015-05-05 02:55:26,540 INFO fetcher.FetcherJob - -finishing thread FetcherThread0, activeThreads=0
2015-05-05 02:55:29,449 INFO fetcher.FetcherJob - 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
2015-05-05 02:55:29,450 INFO fetcher.FetcherJob - -activeThreads=0
2015-05-05 02:55:29,456 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:30,004 INFO fetcher.FetcherJob - FetcherJob: finished at 2015-05-05 02:55:30, time elapsed: 00:00:07
2015-05-05 02:55:31,655 INFO parse.ParserJob - ParserJob: starting at 2015-05-05 02:55:31
2015-05-05 02:55:31,658 INFO parse.ParserJob - ParserJob: resuming: false
2015-05-05 02:55:31,658 INFO parse.ParserJob - ParserJob: forced reparse: false
2015-05-05 02:55:31,658 INFO parse.ParserJob - ParserJob: parsing all
2015-05-05 02:55:32,501 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2015-05-05 02:55:33,268 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:33,680 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2015-05-05 02:55:33,760 INFO parse.ParserJob - Parsing example.com:8080/abc/AUTHENTICATED_PAGE
2015-05-05 02:55:33,767 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:34,517 INFO parse.ParserJob - ParserJob: success
2015-05-05 02:55:34,518 INFO parse.ParserJob - ParserJob: finished at 2015-05-05 02:55:34, time elapsed: 00:00:02
2015-05-05 02:55:36,148 INFO crawl.DbUpdaterJob - DbUpdaterJob: starting at 2015-05-05 02:55:36
2015-05-05 02:55:36,148 INFO crawl.DbUpdaterJob - DbUpdaterJob: updatinging all
2015-05-05 02:55:37,375 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:37,995 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-05-05 02:55:37,995 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2015-05-05 02:55:37,995 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2015-05-05 02:55:38,016 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:38,599 INFO crawl.DbUpdaterJob - DbUpdaterJob: finished at 2015-05-05 02:55:38, time elapsed: 00:00:02
2015-05-05 02:55:40,376 INFO indexer.IndexingJob - IndexingJob: starting
2015-05-05 02:55:40,602 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2015-05-05 02:55:40,602 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2015-05-05 02:55:40,605 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-05-05 02:55:40,605 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2015-05-05 02:55:41,475 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:41,912 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2015-05-05 02:55:42,044 INFO elasticsearch.plugins - [Sigmar] loaded [], sites []
2015-05-05 02:55:43,284 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2015-05-05 02:55:43,284 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2015-05-05 02:55:43,284 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-05-05 02:55:43,284 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2015-05-05 02:55:43,354 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]
2015-05-05 02:55:43,354 INFO elastic.ElasticIndexWriter - Processing to finalize last execute
2015-05-05 02:55:43,388 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:43,746 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2015-05-05 02:55:43,746 INFO indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2015-05-05 02:55:43,752 INFO elasticsearch.plugins - [Valinor] loaded [], sites []
2015-05-05 02:55:43,804 INFO indexer.IndexingJob - IndexingJob: done.
How can i crawl the authenticated page? I feel there is some problem with the session creation. Please help. Thanks in advance.
I solved the problem. All you have to do is slight change in httpclient-auth.xml
<credentials authMethod="formAuth"
loginUrl="http://localhost:44444/Account/Login.aspx"
loginFormId="ctl01"
loginRedirect="true">
<loginPostData>
In loginUrl enter the url which is responsible for the POST request NOT the actual login URL . You can check it by analyzing the 'network' tab in chrome browser.(Ctrl+shift+i is the shortcut to open developer mode). Don't forget to tick 'preserve log'.
Happy Coding. :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.