![](/img/trans.png)
[英]Crawler4j, Some urls are crawled without issue while others are not crawled at all
[英]In Heritrix crawler tool how to extract the contents from crawled urls
Heritrix工具的新手,現在我可以從www抓取網頁,現在要提取抓取的URL的內容。
請任何人幫助我。謝謝。
1.first download the file wget http://python.org/ftp/python/3.3.0/Python-3.3.0.tgz or higher version as root user.
2. change the directory to installed python
3. example /opt/python3.3/;
4. configure the files ./configure --prefix=/opt/python3.3
5.make
6. sudo make install
7. /opt/python3.3/bin/python3
8.opt/python3.3/bin/pyvenv ~/py33
9.source ~/py33/bin/activate
10. wget http://python-distribute.org/distribute_setup.py
11.python distribute_setup.py
12. easy_install pip
13. pip install bottle
14. pip install warcat
15. if successfully installed warcat then check whether your warcat is installed or not.
16. python3 -m warcat --help after enter then we can see some help commands like, list,concat,extract etc..
17.python3 -m warcat list example/at.warc.gz
this is worked for me ..enjoy
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.