Failed to crawl authenticated page with Nutch 2.3

Question

I am trying to crawl authenticated web page with Apache Nutch 2.3. I did the following code changes in Nutch 2.3 source code to support POST based authentication

Changes in source code in file $NUTCH_HOME/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
Created two new java files on the same package HttpFormAuthConfigurer.java and HttpFormAuthentication.java

Copying "plugin.includes" property in $NUTCH_HOME/conf/nutch-site.xml

 <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>

Credential implementation in $NUTCH_HOME/conf/httpclient-auth.xml
</additionalPostHeaders> </removedFormFields-->

I have commented out tag "additionalPostHeaders" and "removedFormFields"

Along with Nutch 2.3 I am using Hbase-0.94.14 and ElasticSearch-1.5.1. I used the following links and discussions to do configuration.

When I am trying to crawl the authenticated page. It is htiting the login URL and I am getting only login URL in elasticsearch result. It is not crawling the content of authenticated page.

Hadoop.log

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication -

Sending 'POST' request to URL : example.com:8080/abc/login.jsp 2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Post parameters : [name=Login, value=Login, name=j_password, value=admin1, name=j_username, value=admin]

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response Code : 200
2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : User-Agent: My Nutch Spider/Nutch-2.3

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Connection: keep-alive

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2015-05-05 02:55:26,250 DEBUG httpclient.HttpFormAuthentication - Response headers : Content-Type: application/x-www-form-urlencoded

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Cookie:

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : User-Agent: My Nutch Spider/Nutch-2.3

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept: text/html,application/xml;q=0.9,application/xhtmlxml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Accept-Encoding: x-gzip, gzip, deflate

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Host: example.com:8080

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Cookie: $Version=0; JSESSIONID=j0ojgF0cRVImcD8sYco75F60jr7ooESeVotAGYXLsv-4CqP8!-231988951; $Path=/abc

2015-05-05 02:55:26,251 DEBUG httpclient.HttpFormAuthentication - Response headers : Content-Length: 51

2015-05-05 02:55:26,399 DEBUG httpclient.HttpFormAuthentication - login post result: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2015-05-05 02:55:26,534 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2015-05-05 02:55:26,540 INFO  fetcher.FetcherJob - -finishing thread FetcherThread0, activeThreads=0
2015-05-05 02:55:29,449 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
2015-05-05 02:55:29,450 INFO  fetcher.FetcherJob - -activeThreads=0
2015-05-05 02:55:29,456 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:30,004 INFO  fetcher.FetcherJob - FetcherJob: finished at 2015-05-05 02:55:30, time elapsed: 00:00:07
2015-05-05 02:55:31,655 INFO  parse.ParserJob - ParserJob: starting at 2015-05-05 02:55:31
2015-05-05 02:55:31,658 INFO  parse.ParserJob - ParserJob: resuming:    false
2015-05-05 02:55:31,658 INFO  parse.ParserJob - ParserJob: forced reparse:      false
2015-05-05 02:55:31,658 INFO  parse.ParserJob - ParserJob: parsing all
2015-05-05 02:55:32,501 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2015-05-05 02:55:33,268 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:33,680 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2015-05-05 02:55:33,760 INFO  parse.ParserJob - Parsing example.com:8080/abc/AUTHENTICATED_PAGE
2015-05-05 02:55:33,767 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:34,517 INFO  parse.ParserJob - ParserJob: success
2015-05-05 02:55:34,518 INFO  parse.ParserJob - ParserJob: finished at 2015-05-05 02:55:34, time elapsed: 00:00:02
2015-05-05 02:55:36,148 INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting at 2015-05-05 02:55:36
2015-05-05 02:55:36,148 INFO  crawl.DbUpdaterJob - DbUpdaterJob: updatinging all
2015-05-05 02:55:37,375 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:37,995 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-05-05 02:55:37,995 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2015-05-05 02:55:37,995 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2015-05-05 02:55:38,016 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:38,599 INFO  crawl.DbUpdaterJob - DbUpdaterJob: finished at 2015-05-05 02:55:38, time elapsed: 00:00:02
2015-05-05 02:55:40,376 INFO  indexer.IndexingJob - IndexingJob: starting
2015-05-05 02:55:40,602 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2015-05-05 02:55:40,602 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2015-05-05 02:55:40,605 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-05-05 02:55:40,605 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2015-05-05 02:55:41,475 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-05-05 02:55:41,912 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2015-05-05 02:55:42,044 INFO  elasticsearch.plugins - [Sigmar] loaded [], sites []
2015-05-05 02:55:43,284 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2015-05-05 02:55:43,284 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2015-05-05 02:55:43,284 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2015-05-05 02:55:43,284 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2015-05-05 02:55:43,354 INFO  elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]
2015-05-05 02:55:43,354 INFO  elastic.ElasticIndexWriter - Processing to finalize last execute
2015-05-05 02:55:43,388 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-05-05 02:55:43,746 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2015-05-05 02:55:43,746 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port  (default 9300)
        elastic.index : elastic index command
        elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2015-05-05 02:55:43,752 INFO  elasticsearch.plugins - [Valinor] loaded [], sites []
2015-05-05 02:55:43,804 INFO  indexer.IndexingJob - IndexingJob: done.

How can i crawl the authenticated page? I feel there is some problem with the session creation. Please help. Thanks in advance.

Answer 1

I solved the problem. All you have to do is slight change in httpclient-auth.xml

 <credentials authMethod="formAuth"
            loginUrl="http://localhost:44444/Account/Login.aspx"
            loginFormId="ctl01"
            loginRedirect="true">
 <loginPostData>

In loginUrl enter the url which is responsible for the POST request NOT the actual login URL . You can check it by analyzing the 'network' tab in chrome browser.(Ctrl+shift+i is the shortcut to open developer mode). Don't forget to tick 'preserve log'.

Happy Coding. :)

Failed to crawl authenticated page with Nutch 2.3

Question

1 answers

solution1
0 2015-05-12 09:33:36

Failed to crawl authenticated page with Nutch 2.3

Question

1 answers

solution1 0 2015-05-12 09:33:36

solution1
0 2015-05-12 09:33:36