简体繁体中英

How do I use crawler if I know the target web-page and file extension but not knowing the file name?

原文 2017-01-04 14:54:09 4 1 python/ html/ nginx/ web-crawler

I have a web-page here that I need to crawl. It looks like this:

www.abc.com/a/b/ ,

and I know that under the /b directory, there are some files with .html extensions I need. I know that I have access to those .html files, but I have no access to www.abc.com/a/b/ . So, without knowing the .html file name, how can I crawl those .html pages?

1 answers

You can't crawl webpages if you don't know how to get to them.

If I understood what you meant, you want to access pages that are accessible in a directory whose index page is not (because you get a 403).

Before you give up, you can try the following:

check if the main search engines link to the pages inside the directory that you seem to know about (because if you know you have access to those .html you probably know at least one of them). The page that includes that link may include other links to files inside that directory as well. For instance, in google, use the link: operator:

link:www.abc.com/a/b/the_file_you_know_exists

check if the website is indexed in the main search engines. For instance, in google, use the site: operator:

site:www.abc.com/a/b/

check if the website is archived in archive.org:

http://web.archive.org/web/*/www.abc.com/a/b/

check if you can find it in other web archives using memento:

http://timetravel.mementoweb.org/reconstruct/*/www.abc.com/a/b/

try to find other possible filenames such as index1.html, index_old.html, index.html_old, contact.html and so on. You could create a long list of the possible filenames to try but this also depends on what you know about the website.

This may give you possible pages from that website that still exist or existed in the past.

How do I locate and use a variable in another file without knowing the name?

How do i search a web-page's HTML for a URL(s) that contains a specific word using python?

How do i XPATH or CSS scrape a Web-Page by utilizing the drop-down menu? (Using Selenium)

Anyone know of a good Python based web crawler that I could use?

How do I make this Web Crawler infinite?

How do I login with a web crawler/scraper?

How do I look up for a specific part / extension of file name?

Python - How do I find a file in a folder if I don't know the extension?

I really do not know about file extension .data

how do I import from a file if I don't know the file name until run time?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I locate and use a variable in another file without knowing the name? How do i search a web-page's HTML for a URL(s) that contains a specific word using python? How do i XPATH or CSS scrape a Web-Page by utilizing the drop-down menu? (Using Selenium) Anyone know of a good Python based web crawler that I could use? How do I make this Web Crawler infinite? How do I login with a web crawler/scraper? How do I look up for a specific part / extension of file name? Python - How do I find a file in a folder if I don't know the extension? I really do not know about file extension .data how do I import from a file if I don't know the file name until run time?

Related Tags

How do I use crawler if I know the target web-page and file extension but not knowing the file name?

Question

1 answers

solution1 0 2017-01-14 18:08:50

solution1
0 2017-01-14 18:08:50