简体   繁体   中英

How can you access the info on a website via a program?

Suppose I want to write a program to read movie info from IMDb, or music info from last.fm, or weather info from weather.com etc., just reading the webpage and parsing it is quiet tedious. Often websites have an xml feed (such as last.fm) set up exactly for this.

Is there a particular link/standard that websites follow for this feed? Such as robot.txt, is there a similar standard for information feeds, or does each website have its own standard?

Websites provide different ways to access this data. Like web services , Feeds, Endpoints to query their data.

And there are programs used to collect data from pages without using standard techniques. These programs are called Bots. These programs use different techniques to get data from websites (NOTE: Be careful Data may be copyright protected)

The most common such standards are RSS and the related Atom . Both are formats for XML syndication of web content. Most software libraries include components for parsing these formats, as they are widespread.

This is the kind of problem RSS or Atom feeds were designed for, so look for a link for an RSS feed if there is one. They're both designed to be simple to parse too. That's normally on sites that have regularly updated content though, like news or blogs. If you're lucky, they'll provide many different RSS feeds for different aspects of the site (the way Stackoverflow does for questions, for instance)

Otherwise, the site may have an API you can use to get the data (like Facebook, Twitter, Google services etc). Failing that, you'll have to resort to screen-scraping and the possible copyright and legal implications that are involved with that.

Sounds to me like you're referring to RSS or Atom feeds. These are specified for a given page in the source; for instance, open the source html for this very page and go to line 22.

Both Atom and RSS are standards. They are both XML based, and there are many parsers for each.

You mentioned screen scraping as the "tedious" option; it is also normally against the terms of service for the website. Doing this may get you blocked. Feed reading is by definition allowed.

There are a number of standards websites use for this, depending on what they are doing, and what they want to do.

RSS is a protocol for sending out formatted chunks of data in machine-parsable form. It stands for "Real Simple Syndication" and is usually used for news feeds, blogs, and other things where there is new content on a periodic or sporadic basis. There are dozens of RSS readers which allow one to subscribe to multiple RSS sources and periodically check them for new data. It is intended to be lightweight.

AJAX is a protocol for sending commands from websites to the web server and getting results back in a machine-parsable form. It is designed to work with JavaScript on the web client. The AJAX standard specifies how to format and send a request and how to format and send a reply, as well as how to parse the requests and replies. It tends to be up to the developers to know what commands are available via AJAX.

SOAP is another protocol like AJAX, but it's uses tend to be more program-to-program, rather than from web client to server. SOAP allows for auto-discovery of what commands are available by use of a machine-readable file in WSTL format, which essentially specifies in XML the method signatures and types used by a particular SOAP interface.

Not all sites use RSS, AJAX, or SOAP. Last.fm, one of the examples you listed, does not seem to support RSS and uses it's own web-based API for getting information from the site. In those cases, you have to find out what their API is (Last.fm appears to be well documented, however).

Choosing the method of obtaining data depends on the application. If its a public/commercial application screen scraping won't be an option. (Eg if you want to use IMDB information commercially then you will need to make contract paying them 15000$ or more according to their website's usage policy)

I think your problem isn't not knowing the standard procedure for obtaining website information but rather not knowing that your inability to obtain data is due to websites not wanting to provide that data.

If a website wants you to be able to use their information, then there will almost certainly be a well documented api interface with various standard protocols for queries.

A list of APIs can be found here.

Dataformats listed at this particular sites are: CSV, GeoRSS, HTML, JSON, KML, OPML, OpenSearch, PHP, RDF, RSS, Text, XML, XSPF, YAML, CSV, GEORSS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM