简体   繁体   中英

How to determine the underlying URL of text file download

On the page below there is ability to downlaod a txt file. I'm interested in the first file in the txt section.

How do I get the URL. I can pull it. How I get the url that does not include java script with python.

Today its: volume.20110218.txt.

http://www.optionsclearing.com/webapps/trade-volume-download

You're question is a bit vague. It sounds like you'd like to do something with the urllib2 and BeautifulSoup modules.

Fetch the HTML from the base URL with urllib2 's functions, parse it with BeautifulSoup and use the target (value of the src attribute) of the (first TXT?) anchor tag in the table to open another connection and pull those contents. Then open your local file (or subprocess) and feed the contents of the second fetch thereto.

The toughest part of using BeautifulSoup is to find the characteristics which uniquely identify the part of the content that you want to extract. Modern HTML is pretty ugly and tends to have lots of extraneous garbage embedded in it by the various tools and libraries which were used to generate it. (One tip: the word "class" is a Python reserved keyword as well as a common attribute in HTML. Thus you'll find it easiest to pass "class" attribute/pattern pairs to BeautifulSoup functions by wrapping them in a dictionary: {'class': some_pattern} rather than in the more common keyword=pattern form that's used for most other arguments).

To handle the javascript you might want to read:

What's a good tool to screen-scrape with Javascript support?

It sounds like your best bet, currently, may be to set up the Java-based HTMLUnit package to serve as a gateway, then write your Python to connect to and control that. You might also try Selenium to control real browser session and extract information from it via inter-process communications mechanisms.

The page uses javascript links to submit a hidden form in order to download the file. The form hidden fields seems to be filled also by javascript.

Seems like they do this in order to make automated download harder to accomplish. If they don't mind automated download, ask them for an easier interface, otherwise, stop trying to do it.

UPDATE: as commented by Jeremiah , they indeed have a batch interface:

http://www.optionsclearing.com/market-data/batch-processing.jsp

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM