简体   繁体   中英

How can I download the raw html of an arbitrary web page into a Javascript string?

I have been using a perl program to download and scrape Yahoo stock pages, and convert the desired info to a json file which I read into an html/javascript file for further processing and display.

I would like to avoid the perl step, and download the raw html directly into my javascript.

I understand that XMLHttpRequest will only download from the server that loaded the html file, but not from an arbitrary web page.

How can I download the raw html of an arbitrary web page into a javascript string?

I'd prefer to do it with plain vanilla javascript if possible (well, jQuery would be OK).

You can't do that, short answer. Unless all pages are located in the same domain , which they are not as you would be going cross-domain.

JavaScript has it's limitations as the same-origin policy . That is why you can't go cross domain with your javascript! As you might figure this is due to security reasons.

What you can do!

  • XmlHttpRequests (XHR`s) if the combination scheme://domain:port is the same for the page hosting the JavaScript that should fetch the HTML.

  • I do happen to know that firefox extensions are NOT limited by cross-origin restriction but thats about it.

Okay I have done some looking around what you could do is this! Is to use

YQL or Yahoo Query Language.

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience. The following YQL statement, for example, retrieves geo data for Sunnyvale, CA:

select * from geo.places where text="sunnyvale, ca"

To access the YQL Web Service, a Web application can call HTTP GET, passing the YQL statement as a URL parameter, for example:

So lets say we where to scrape craiglist.com

http://query.yahooapis.com/v1/public/yql?q=select * from html where url="http://craigslist.com"

You can see my query here CraglistQuery

This will give you a json that look like this I had cut some parts of since it's huge!

       {
        "href": "#ASIA",
        "content": "Asia/Pacific/Middle East"
       },
       {
        "href": "#OCEANIA",
        "content": "Oceania"
       },
       {
        "href": "#LATAM",
        "content": "Latin America"
       },
       {
        "href": "#AF",
        "content": "Africa"
       }
      ]
     },
     {
      "id": "map",
      "style": "border: 1px solid #551A8B; background-color: #71A4CD;"
     },
     {
      "class": "colmask",
      "div": [
       {
        "class": "box box_1",
        "h4": [
         "Alabama",
         "Alaska",
         "Arizona",
         "Arkansas",
         "California",
         "Colorado",
         "Connecticut",
         "Delaware",
         "District of Columbia",
         "Florida",
         "Georgia",
         "Hawaii",
         "Idaho"
        ],
        "ul": [
         {
          "li": [
           {
            "a": {
             "href": "http://auburn.craigslist.org",
             "content": "auburn"
            }
           },

Should you then want to take a specific part of that page, you can use WHERE statements and in this case you will be using xpath.

Then it looks something like this.

select * from html where url="http://craigslist.com"  and xpath ="/div/div"

Will give you just that portion of the page. Here's a result

{
 "query": {
  "count": 0,
  "created": "2014-01-27T10:25:00Z",
  "lang": "en-US",
  "diagnostics": {
   "publiclyCallable": "true",
   "redirect": [
    {
     "from": "http://craigslist.com/",
     "status": "302",
     "content": "http://craigslist.org/"
    },
    {
     "from": "http://craigslist.org/",
     "status": "302",
     "content": "http://www.craigslist.org/"
    },
    {
     "from": "http://www.craigslist.org/",
     "status": "302",
     "content": "http://geo.craigslist.org/"
    },
    {
     "from": "http://geo.craigslist.org/",
     "status": "302",
     "content": "http://www.craigslist.org/about/sites"
    }
   ],
   "url": [
    {
     "execution-start-time": "0",
     "execution-stop-time": "1401",
     "execution-time": "1401",
     "content": "http://craigslist.com"
    },
    {
     "execution-start-time": "0",
     "execution-stop-time": "1401",
     "execution-time": "1401",
     "content": "http://craigslist.com"
    }
   ],
   "user-time": "1406",
   "service-time": "2783",
   "build-version": "0.2.2157"
  },
  "results": null
 }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM