简体   繁体   English

Javascript 可以读取任何 web 页面的源代码吗?

[英]Can Javascript read the source of any web page?

I am working on screen scraping, and want to retrieve the source code a particular page.我正在处理屏幕抓取,并希望检索特定页面的源代码。

How can achieve this with javascript?如何用 javascript 实现这一点? Please help me.请帮我。

Simple way to start, try jQuery简单的开始方式,试试jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

More at jQuery Docs更多信息请访问jQuery 文档

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language.另一种以更加结构化的方式进行屏幕抓取的方法是使用YQL 或 Yahoo 查询语言。 It will return the scraped data structured as JSON or xml.它将返回结构为 JSON 或 xml 的抓取数据。
eg例如
Let's scrape stackoverflow.com让我们抓取 stackoverflow.com

select * from html where url="http://stackoverflow.com"

will give you a JSON array (I chose that option) like this会给你一个 JSON 数组(我选择了那个选项)像这样

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)这样做的好处在于,您可以进行预测和 where子句,最终让您获得结构化的抓取数据,并且只获取您需要的数据(最终通过线路传输的带宽要少得多)
eg例如

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

will get you会得到你

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

Now to get only the questions we do a现在只得到我们做的问题

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

Note the title in projections注意投影中的标题

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

Once you write your query it generates a url for you编写查询后,它会为您生成一个 url

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20% 20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

in our case.在我们的例子中。

So ultimately you end up doing something like this所以最终你最终会做这样的事情

var titleList = $.getJSON(theAboveUrl);

and play with it.和它一起玩。

Beautiful , isn't it?漂亮,不是吗?

Javascript can be used, as long as you grab whatever page you're after via a proxy on your domain:可以使用 Javascript,只要您通过域上的代理抓取您想要的任何页面:

<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) { 
    alert(response) 
});
</script>
</body>

You can use fetch :您可以使用获取

 const URL = 'https://www.sap.com/belgique/index.html'; fetch(URL).then(res => res.text()).then(text => { console.log(text); }).catch(err => console.log(err));

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property.您可以简单地使用XmlHttp (AJAX) 来点击所需的 URL,并且来自 URL 的 HTML 响应将在responseText属性中可用。 If it's not the same domain, your users will receive a browser alert saying something like "This page is trying to access a different domain. Do you want to allow this?"如果不是同一个域,您的用户将收到一条浏览器警告,内容类似于“此页面正在尝试访问不同的域。您要允许吗?”

As a security measure, Javascript can't read files from different domains.作为安全措施,Javascript 不能读取来自不同域的文件。 Though there might be some strange workaround for it, I'd consider a different language for this task.虽然可能有一些奇怪的解决方法,但我会考虑使用不同的语言来完成这项任务。

I used ImportIO .我使用了 ImportIO They let you request the HTML from any website if you set up an account with them (which is free).如果您在他们那里设置了一个帐户(免费),他们可以让您从任何网站请求 HTML。 They let you make up to 50k requests per year.他们允许您每年提出多达 5 万个请求。 I didn't take them time to find an alternative, but I'm sure there are some.我没有让他们花时间寻找替代方案,但我确信有一些。

In your Javascript, you'll basically just make a GET request like this:在您的 Javascript 中,您基本上只需像这样发出 GET 请求:

 var request = new XMLHttpRequest(); request.onreadystatechange = function() { jsontext = request.responseText; alert(jsontext); } request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true); request.send();

Sidenote: I found this question while researching what I felt like was the same question, so others might find my solution helpful.旁注:我在研究我觉得是同一个问题时发现了这个问题,所以其他人可能会发现我的解决方案有帮助。

UPDATE: I created a new one which they just allowed me to use for less than 48 hours before they said I had to pay for the service.更新:我创建了一个新的,他们只允许我使用不到 48 小时,然后他们说我必须为这项服务付费。 It seems that they shut down your project pretty quick now if you aren't paying.如果您不付款,他们现在似乎很快就会关闭您的项目。 I made my own similar service with NodeJS and a library called NightmareJS.我用 NodeJS 和一个名为 NightmareJS 的库制作了自己的类似服务。 You can see their tutorial here and create your own web scraping tool.您可以在此处查看他们的教程并创建您自己的 web 抓取工具。 It's relatively easy.这相对容易。 I haven't tried to set it up as an API that I could make requests to or anything.我还没有尝试将其设置为 API,我可以向其提出请求或任何请求。

If you absolutely need to use javascript, you could load the page source with an ajax request.如果您绝对需要使用 javascript,您可以使用 ajax 请求加载页面源。

Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.请注意,对于 javascript,您只能检索与请求页面位于同一域下的页面。

Using jquery使用 jquery

<html>
<head>
<script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script>
</head>
<body>
<script>
$.get("www.google.com", function(response) { alert(response) });
</script>
</body>

You can bypass the same-origin-policy by either creating a browser extension or even saving the file as.hta in Windows (HTML Application).您可以通过创建浏览器扩展程序或什至将文件保存为 Windows(HTML 应用程序)中的 .hta 来绕过同源策略。

Despite many comments to the contrary I believe that it is possible to overcome the same origin requirement with simple JavaScript.尽管有很多相反的评论,但我相信可以通过简单的 JavaScript 来克服同源要求。

I am not claiming that the following is original because I believe I saw something similar elsewhere a while ago.我并不是说以下内容是原创的,因为我相信我不久前在其他地方看到过类似的东西。

I have only tested this with Safari on a Mac.我只在 Mac 上用 Safari 测试过这个。

The following demonstration fetches the page in the base tag and and moves its innerHTML to a new window. My script adds html tags but with most modern browsers this could be avoided by using outerHTML.以下演示获取基本标记中的页面并将其 innerHTML 移动到新的 window。我的脚本添加了 html 标记,但对于大多数现代浏览器,这可以通过使用 outerHTML 来避免。

<html>
<head>
<base href='http://apod.nasa.gov/apod/'>
<title>test</title>
<style>
body { margin: 0 }
textarea { outline: none; padding: 2em; width: 100%; height: 100% }
</style>
</head>
<body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)">
<textarea id=t></textarea>
</body>
</html>
javascript:alert("Inspect Element On");
javascript:document.body.contentEditable = 'true';
document.designMode='on'; 
void 0;
javascript:alert(document.documentElement.innerHTML); 

Highlight this and drag it to your bookmarks bar and click it when you wanna edit and view the current sites source code.突出显示它并将其拖到您的书签栏,然后在您想要编辑和查看当前站点源代码时单击它。

You can generate a XmlHttpRequest and request the page,and then use getResponseText() to get the content.您可以生成一个 XmlHttpRequest 并请求页面,然后使用 getResponseText() 获取内容。

You can use the FileReader API to get a file, and when selecting a file, put the url of your web page into the selection box.你可以使用FileReader API来获取一个文件,在选择文件的时候把你的web页面的url放到选择框里。 Use this code:使用此代码:

function readFile() {
    var f = document.getElementById("yourfileinput").files[0]; 
    if (f) {
      var r = new FileReader();
      r.onload = function(e) { 
        alert(r.result);
      }
      r.readAsText(f);
    } else { 
      alert("file could not be found")
    }
  }
}

jquery is not the way of doing things. jquery 不是做事的方式。 Do in purre javascript做纯粹的 javascript

var r = new XMLHttpRequest();
    r.open('GET', 'yahoo.comm', false);
    r.send(null); 
if (r.status == 200) { alert(r.responseText); }
<script>
    $.getJSON('http://www.whateverorigin.org/get?url=' + encodeURIComponent('hhttps://example.com/') + '&callback=?', function (data) {
        alert(data.contents);
    });

</script>

Include jQuery and use this code to get HTML of other website.包含jQuery并使用此代码获取其他网站的HTML。 Replace example.com with your website .将 example.com 替换为您的网站

This method involves an external server fetching the sites HTML & sending it to you.此方法涉及外部服务器获取站点 HTML 并将其发送给您。 :) :)

On linux拨打 linux

  1. download slimerjs (slimerjs.org)下载 slimerjs (slimerjs.org)

  2. download firefox version 59下载 firefox 版本 59

  3. add this environment variable: export SLIMERJSLAUNCHER=/home/en/Letöltések/firefox59/firefox/firefox添加此环境变量:export SLIMERJSLAUNCHER=/home/en/Letöltések/firefox59/firefox/firefox

  4. on slimerjs download page use this.js program (./slomerjs program.js):在 slimerjs 下载页面上使用 this.js 程序 (./slomerjs program.js):

     var page = require('webpage').create(); page.open( 'http://www.google.com/search?q=görény', function() { page.render('goo2.pdf'); phantom.exit(); } );

Use pdftotext to get text on the page.使用 pdftotext 获取页面上的文本。


  
  
    const URL = 'https://wwww.w3schools.com';
    fetch(URL)
    .then(res => res.text())
    .then(text => {
        console.log(text);
    })
    .catch(err => console.log(err));
    const URL = 'https://www.sap.com/belgique/index.html';
    fetch(URL)
    .then(res => res.text())
    .then(text => {
        console.log(text);
    })
    .catch(err => console.log(err));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM