[英]web scraping font access issue
I am working on web scraping for one of our client's site.我正在为我们客户的网站之一进行网络抓取。 All working fine.
一切正常。 But I am getting one issue that the font is not working.
但是我遇到了一个问题,即字体不起作用。 I am getting following error in chrome console:
我在 chrome 控制台中收到以下错误:
Access to Font at 'https://www.example.com/fonts/fontawesome-webfont.woff?v=4.2.0' from origin 'http://www.mydomain' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
CORS 政策已阻止从源“http://www.mydomain”访问“https://www.example.com/fonts/fontawesome-webfont.woff?v=4.2.0”中的字体:无“访问” -Control-Allow-Origin' 标头存在于请求的资源上。 Origin 'http://www.mydomain' is therefore not allowed access.
因此,不允许访问 Origin 'http://www.mydomain'。
I have try to put following code in http://www.mydomain .htaccess file but no luck我尝试将以下代码放在http://www.mydomain .htaccess 文件中,但没有运气
.htaccess .htaccess
<IfModule mod_headers.c>
<FilesMatch "\.(ttf|ttc|otf|eot|woff|font.css|css)$">
Header set Access-Control-Allow-Origin "*"
Header set Access-Control-Allow-Headers "Cache-Control, Pragma, Origin, Authorization, Content-Type, X-Requested-With"
Header set Access-Control-Allow-Methods "GET, PUT, POST"
</FilesMatch>
</IfModule>
Note : I can not do any change https://www.example.com
and in my browswer cache is also disabled.注意:我无法对
https://www.example.com
进行任何更改,并且在我的浏览器中缓存也被禁用。
php code for web scraping:用于网页抓取的 php 代码:
$cookie = 'cookies.txt';
$timeout = 90;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 400);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($ch, CURLOPT_FILETIME, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
EDIT编辑
apache headers module is also enabled apache headers 模块也已启用
To enable accessing the font on the server www.example.com
from the website on the server www.mydomain
the server www.example.com
needs to allow the request from www.mydomain
.要从服务器
www.mydomain
上的网站访问服务器www.example.com
上的字体,服务器www.example.com
需要允许来自www.mydomain
的请求。 For that on the server www.example.com
in the response to a HTTP request (get) the response must contain (at least) the following header:对于服务器
www.example.com
上对 HTTP 请求 (get) 的响应,响应必须(至少)包含以下标头:
Access-Control-Allow-Origin: http://www.mydomain
If you have no control to configure the server www.example.com
in such a manner, you need to download the resource as well and place it with the scraped content and change the link to it.如果您无法控制以这种方式配置服务器
www.example.com
,您还需要下载资源并将其与抓取的内容放在一起,并更改指向它的链接。 See the Q&A reference resource "How do you parse and process HTML/XML in PHP?"请参阅问答参考资源“如何在 PHP 中解析和处理 HTML/XML?” for an introduction into HTML processing with PHP.
用于介绍使用 PHP 处理 HTML。 There are also ready-made PHP libraries for scraping that can support you in your task.
还有现成的用于抓取的 PHP 库,可以为您的任务提供支持。
There are many reasons this may not be working for you.这可能对您不起作用的原因有很多。
.htaccess
. .htaccess
。 You will have to specify the AllowOverride
directive correctly (for Apache) in the right place (Usually apache2.conf
).apache2.conf
)正确指定AllowOverride
指令(对于 Apache)。In the case of the later you can rewrite the script to load the resources based on the request protocol.对于后者,您可以重写脚本以根据请求协议加载资源。 eg:
例如:
//maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css
This will allow the browser to use either http or https based on the request if you have access to the source code of example.com.如果您有权访问 example.com 的源代码,这将允许浏览器根据请求使用 http 或 https。 If you don't, it's far better for you to scrape the https version of example.com than to hack the CORS configuration.
如果不这样做,那么抓取 example.com 的 https 版本比破解 CORS 配置要好得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.