简体   繁体   English

如何在PHP中使用file_get_contents

[英]How to use file_get_contents in php

I used to scrap a website for information using the file_get_contents command in PHP. 我曾经使用PHP中的file_get_contents命令来抓取网站以获取信息。 Although now every time I try to go scrap the webpage it only returns 尽管现在每次我尝试删除该网页时,它只会返回

<html><head><meta http-equiv="Refresh" content="0; URL=http://website.com/latest.php?ckattempt=1"></head><body></body></html>

This was the code that I had used that used to work 这是我曾经使用过的代码

$opts = array(
    'http'=>array(
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n".
                  "Referer: ".$url."/index.php".
                  "Cookie: id=<id token>; auth=<auth token>;"
    )
);
$context = stream_context_create($opts);
$html = file_get_contents($url.'/latest.php?ckattempt=0', false, $context);

I am assuming that it has to do with something dealing with the refresh meta tag, but does anyone know of any ways I could get around this by chance so I can scrap the webpage again? 我以为它与处理meta标记有关,但是有人知道我可以通过任何方式解决此问题,以便我再次删除该网页吗?

If i interpret your question correctly, your problem stems from the fact that on the target server the site you usually loaded has changed. 如果我正确解释了您的问题,那么您的问题就源于您通常在目标服务器上加载的站点已更改的事实。 Instead of the old page, the page you are loading is now using a meta tag (called meta refresh ) to forward the client to another page (to http://website.com/latest.php?ckattempt=1 in this particular example). 现在,您正在加载的页面不是旧页面,而是使用meta标记(称为meta refresh )将客户端转发到另一个页面(在此特定示例中为http://website.com/latest.php?ckattempt=1 )。

Read about meta refresh here 在此处阅读有关元刷新的信息

What you need to do (in order to get to the data you'd like to read) is probably to follow that link, which means that you should load the URL provided in that meta tag and read the data from there. 您需要做的(为了获得想要读取的数据)可能就是跟随该链接,这意味着您应该加载该meta标记中提供的URL并从那里读取数据。

CURL can follow redirects but i am not entirely sure it will follow a meta tag, as this is a rather revoked method of forwarding and i don't remember CURL as spending an awful lot of time parsing incoming HTML code (not at all actually). CURL可以跟随重定向,但是我不完全确定它将跟随一个meta标记,因为这是一种相当废止的转发方法,而且我不记得CURL花费了大量时间来解析传入的HTML代码(实际上根本没有) 。

Use of meta refresh is discouraged by the World Wide Web Consortium (W3C) 万维网联盟(W3C)不鼓励使用元刷新

Your best option in the given case is to parse the incoming data, pick out the desired information (which is the URL) and load that url instead. 在给定情况下,最好的选择是解析传入的数据,挑选所需的信息(即URL),然后加载该URL。

You could do this using regex. 您可以使用正则表达式执行此操作。 See this question about which regex to use to detect a link in a string . 请参阅有关使用哪个正则表达式来检测字符串中的链接的问题

Abstract steps: 抽象步骤:

  • Load page using your common file_get_contents() call 使用常见的file_get_contents()调用加载页面
  • Parse the incoming page and see if it contains a meta tag with the http-equiv attribute set to refresh 解析传入的页面,查看它是否包含带有设置为refreshhttp-equiv属性的meta标记
  • If you find this tag, pass the contents you received to a function which extracts the target URL 如果找到此标记,请将收到的内容传递给提取目标URL的函数
  • Use file_get_contents() on that target URL to get the data you aim for 在该目标URL上使用file_get_contents()以获得您想要的数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM