简体繁体中英

How to determine the underlying URL of text file download

原文 2011-02-20 22:47:51 2 2 javascript/ python/ url/ web-scraping

On the page below there is ability to downlaod a txt file. I'm interested in the first file in the txt section.

How do I get the URL. I can pull it. How I get the url that does not include java script with python.

Today its: volume.20110218.txt.

http://www.optionsclearing.com/webapps/trade-volume-download

2 answers

You're question is a bit vague. It sounds like you'd like to do something with the urllib2 and BeautifulSoup modules.

Fetch the HTML from the base URL with urllib2 's functions, parse it with BeautifulSoup and use the target (value of the src attribute) of the (first TXT?) anchor tag in the table to open another connection and pull those contents. Then open your local file (or subprocess) and feed the contents of the second fetch thereto.

The toughest part of using BeautifulSoup is to find the characteristics which uniquely identify the part of the content that you want to extract. Modern HTML is pretty ugly and tends to have lots of extraneous garbage embedded in it by the various tools and libraries which were used to generate it. (One tip: the word "class" is a Python reserved keyword as well as a common attribute in HTML. Thus you'll find it easiest to pass "class" attribute/pattern pairs to BeautifulSoup functions by wrapping them in a dictionary: {'class': some_pattern} rather than in the more common keyword=pattern form that's used for most other arguments).

To handle the javascript you might want to read:

What's a good tool to screen-scrape with Javascript support?

It sounds like your best bet, currently, may be to set up the Java-based HTMLUnit package to serve as a gateway, then write your Python to connect to and control that. You might also try Selenium to control real browser session and extract information from it via inter-process communications mechanisms.

The page uses javascript links to submit a hidden form in order to download the file. The form hidden fields seems to be filled also by javascript.

Seems like they do this in order to make automated download harder to accomplish. If they don't mind automated download, ask them for an easier interface, otherwise, stop trying to do it.

UPDATE: as commented by Jeremiah , they indeed have a batch interface:

http://www.optionsclearing.com/market-data/batch-processing.jsp

Download text file from an external URL

Read URL from text file and then download the file as an exe. Javascript

How to prevent text or script file to be seen/download by directly entering it browser url bar?

How to trigger file download on URL without extension

How to download file without url using casperjs

How to download json file from external URL

How to download file from URL with a lot of redirects

How to download .js file by JavaScript from an url?

How to download a file from a url with Javascript?

Javascript: How to download csv file from a url

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Download text file from an external URL Read URL from text file and then download the file as an exe. Javascript How to prevent text or script file to be seen/download by directly entering it browser url bar? How to trigger file download on URL without extension How to download file without url using casperjs How to download json file from external URL How to download file from URL with a lot of redirects How to download .js file by JavaScript from an url? How to download a file from a url with Javascript? Javascript: How to download csv file from a url

Related Tags

How to determine the underlying URL of text file download

Question

2 answers

solution1
1 ACCPTED 2011-02-20 23:06:21

solution2
1 2011-02-20 23:29:38

How to determine the underlying URL of text file download

Question

2 answers

solution1 1 ACCPTED 2011-02-20 23:06:21

solution2 1 2011-02-20 23:29:38

solution1
1 ACCPTED 2011-02-20 23:06:21

solution2
1 2011-02-20 23:29:38