简体   繁体   English

网页抓取字体访问问题

[英]web scraping font access issue

I am working on web scraping for one of our client's site.我正在为我们客户的网站之一进行网络抓取。 All working fine.一切正常。 But I am getting one issue that the font is not working.但是我遇到了一个问题,即字体不起作用。 I am getting following error in chrome console:我在 chrome 控制台中收到以下错误:

Access to Font at 'https://www.example.com/fonts/fontawesome-webfont.woff?v=4.2.0' from origin 'http://www.mydomain' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. CORS 政策已阻止从源“http://www.mydomain”访问“https://www.example.com/fonts/fontawesome-webfont.woff?v=4.2.0”中的字体:无“访问” -Control-Allow-Origin' 标头存在于请求的资源上。 Origin 'http://www.mydomain' is therefore not allowed access.因此,不允许访问 Origin 'http://www.mydomain'。

在此处输入图片说明

I have try to put following code in http://www.mydomain .htaccess file but no luck我尝试将以下代码放在http://www.mydomain .htaccess 文件中,但没有运气

.htaccess .htaccess

<IfModule mod_headers.c>
  <FilesMatch "\.(ttf|ttc|otf|eot|woff|font.css|css)$">
    Header set Access-Control-Allow-Origin "*"
    Header set Access-Control-Allow-Headers "Cache-Control, Pragma, Origin, Authorization, Content-Type, X-Requested-With"
    Header set Access-Control-Allow-Methods "GET, PUT, POST"
  </FilesMatch>
</IfModule>

Note : I can not do any change https://www.example.com and in my browswer cache is also disabled.注意:我无法对https://www.example.com进行任何更改,并且在我的浏览器中缓存也被禁用。

php code for web scraping:用于网页抓取的 php 代码:

$cookie = 'cookies.txt';
$timeout = 90;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT,        400); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,  $timeout );
curl_setopt($ch, CURLOPT_COOKIEJAR,       $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE,      $cookie);
curl_setopt($ch, CURLOPT_USERAGENT,
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($ch, CURLOPT_FILETIME, true);   
$curl_scraped_page = curl_exec($ch);    
curl_close($ch);
echo $curl_scraped_page;

EDIT编辑

apache headers module is also enabled apache headers 模块也已启用

在此处输入图片说明

To enable accessing the font on the server www.example.com from the website on the server www.mydomain the server www.example.com needs to allow the request from www.mydomain .要从服务器www.mydomain上的网站访问服务器www.example.com上的字体,服务器www.example.com需要允许来自www.mydomain的请求。 For that on the server www.example.com in the response to a HTTP request (get) the response must contain (at least) the following header:对于服务器www.example.com上对 HTTP 请求 (get) 的响应,响应必须(至少)包含以下标头:

Access-Control-Allow-Origin: http://www.mydomain

If you have no control to configure the server www.example.com in such a manner, you need to download the resource as well and place it with the scraped content and change the link to it.如果您无法控制以这种方式配置服务器www.example.com ,您还需要下载资源并将其与抓取的内容放在一起,并更改指向它的链接。 See the Q&A reference resource "How do you parse and process HTML/XML in PHP?"请参阅问答参考资源“如何在 PHP 中解析和处理 HTML/XML?” for an introduction into HTML processing with PHP.用于介绍使用 PHP 处理 HTML。 There are also ready-made PHP libraries for scraping that can support you in your task.还有现成的用于抓取的 PHP 库,可以为您的任务提供支持。

There are many reasons this may not be working for you.这可能对您不起作用的原因有很多。

  1. Web server configuration: Your web server is not configured to recognize individual .htaccess . Web 服务器配置:您的 Web 服务器未配置为识别单个.htaccess You will have to specify the AllowOverride directive correctly (for Apache) in the right place (Usually apache2.conf ).您必须在正确的位置(通常是apache2.conf )正确指定AllowOverride指令(对于 Apache)。
  2. You are using a software (eg) Wordpress which is rewriting your homepage request to a http version.您正在使用一种软件(例如)Wordpress,它将您的主页请求重写为 http 版本。
  3. You are using only https version of the font resource您仅使用 https 版本的字体资源

In the case of the later you can rewrite the script to load the resources based on the request protocol.对于后者,您可以重写脚本以根据请求协议加载资源。 eg:例如:

//maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css

This will allow the browser to use either http or https based on the request if you have access to the source code of example.com.如果您有权访问 example.com 的源代码,这将允许浏览器根据请求使用 http 或 https。 If you don't, it's far better for you to scrape the https version of example.com than to hack the CORS configuration.如果不这样做,那么抓取 example.com 的 https 版本比破解 CORS 配置要好得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM