I have a web-page here that I need to crawl. It looks like this:
www.abc.com/a/b/
,
and I know that under the /b
directory, there are some files with .html
extensions I need. I know that I have access to those .html
files, but I have no access to www.abc.com/a/b/
. So, without knowing the .html
file name, how can I crawl those .html
pages?
You can't crawl webpages if you don't know how to get to them.
If I understood what you meant, you want to access pages that are accessible in a directory whose index page is not (because you get a 403).
Before you give up, you can try the following:
link:
operator: link:www.abc.com/a/b/the_file_you_know_exists
site:
operator: site:www.abc.com/a/b/
http://web.archive.org/web/*/www.abc.com/a/b/
http://timetravel.mementoweb.org/reconstruct/*/www.abc.com/a/b/
This may give you possible pages from that website that still exist or existed in the past.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.