简体   繁体   中英

How do I use crawler if I know the target web-page and file extension but not knowing the file name?

I have a web-page here that I need to crawl. It looks like this:

www.abc.com/a/b/ ,

and I know that under the /b directory, there are some files with .html extensions I need. I know that I have access to those .html files, but I have no access to www.abc.com/a/b/ . So, without knowing the .html file name, how can I crawl those .html pages?

You can't crawl webpages if you don't know how to get to them.

If I understood what you meant, you want to access pages that are accessible in a directory whose index page is not (because you get a 403).

Before you give up, you can try the following:

  • check if the main search engines link to the pages inside the directory that you seem to know about (because if you know you have access to those .html you probably know at least one of them). The page that includes that link may include other links to files inside that directory as well. For instance, in google, use the link: operator:

link:www.abc.com/a/b/the_file_you_know_exists

  • check if the website is indexed in the main search engines. For instance, in google, use the site: operator:

site:www.abc.com/a/b/

  • check if the website is archived in archive.org:

http://web.archive.org/web/*/www.abc.com/a/b/

  • check if you can find it in other web archives using memento:

http://timetravel.mementoweb.org/reconstruct/*/www.abc.com/a/b/

  • try to find other possible filenames such as index1.html, index_old.html, index.html_old, contact.html and so on. You could create a long list of the possible filenames to try but this also depends on what you know about the website.

This may give you possible pages from that website that still exist or existed in the past.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM