简体   繁体   English

Facebook调试器不会刮我的网站

[英]Facebook debugger won't scrape my site

I'm creating the site http://Meer.li and when I run it through facebook debugger - http://developers.facebook.com/tools/debug/og/object?q=meer.li - it can't find my meta-tags. 我正在创建网站http://Meer.li ,当我通过facebook调试器运行它时 - http://developers.facebook.com/tools/debug/og/object?q=meer.li - 它不能找到我的元标记。

When I look at the source of what facebook scrapes, it shows a stripped down version of my site, where it has changed the doc-type and there's no meta tags - http://developers.facebook.com/tools/debug/og/echo?q=http%3A%2F%2Fmeer.li%2F . 当我查看facebook刮擦的来源时,它显示了我的网站的精简版本,它更改了doc-type并且没有元标记 - http://developers.facebook.com/tools/debug/og /echo?q=http%3A%2F%2Fmeer.li%2F

What am I doing wrong here? 我在这做错了什么?

I'm running rails 3.2, ruby 1.9.3 and the whole thing is running on Heroku with a mongo database. 我正在运行rails 3.2,ruby 1.9.3并且整个东西在Heroku上运行,带有一个mongo数据库。

Edit 编辑

It seems that I do have the right accept header in my app... if I do this in the different views: 看来我的应用程序中确实有正确的接受标头...如果我在不同的视图中这样做:

<%= request.headers["Accept"] %>

I get: 我明白了:

text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Why can we scrape the whole site if we do curl -H and the right headers? 如果我们卷曲-H和正确的标题,为什么我们可以刮掉整个网站? Why doesn't facebook scrape my site? 为什么facebook不刮我的网站?

Trying your url in the debugger it says that the response status code is 206 which means "Partial Content". 在调试器中尝试您的URL,它表示响应状态代码为206,这意味着“部分内容”。

I tried to curl the url and indeed the response I got is partial, it's does not include the html, head and body tags (or their closing tags), and looks like jsonp response of html wrapped in 我试图卷曲网址,实际上我得到的响应是偏的,它不包括html,head和body标签(或它们的结束标签),看起来像html包含的jsonp响应

$("#designs_content").append

I'm not sure why that happens, maybe your server checks the user agent string of the requests and response according to that? 我不确定为什么会发生这种情况,也许你的服务器根据这个检查请求和响应的用户代理字符串?


Edit 编辑

I'm not sure if this has anything to do with Heroku, I've never worked with them. 我不确定这是否与Heroku有任何关系,我从未与他们合作过。 Also, I know nothing about rails so I can't help with that. 另外,我对rails一无所知,所以我无能为力。

Wget has nothing to do with this, it's the response that your web server returns based on the headers of the http request. Wget与此无关,它是您的Web服务器根据http请求的标头返回的响应。 When you make a request using a browser it adds some headers to the request to help the server figure out a few things. 当您使用浏览器发出请求时,它会向请求添加一些标头,以帮助服务器找出一些内容。 You can view the sent headers if you open firebug or the developers tools in chrome (safari, etc), in a networks tab (they all have that) or using a network sniffer. 如果您在chrome(safari等)中打开firebug或开发人员工具,在网络选项卡(他们都有)或使用网络嗅探器,您可以查看发送的标头。

To make life easier for you, I checked what's the header that causes this problem for you... try this: 为了让您的生活更轻松,我检查了导致此问题的标题是什么...试试这个:

curl "http://meer.li/"

And you'll see that the response is of a jsonp and not the entire html page. 你会看到响应是jsonp,而不是整个html页面。 Now try this: 现在试试这个:

curl -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" "http://meer.li/"

And you'll get the full html version of your page. 并且您将获得页面的完整html版本。

Since facebook, when scrapping your page, does not send the "accept" header the response is not what you see when you view the source using the browser. 由于Facebook在删除页面时没有发送“接受”标题,因此当您使用浏览器查看源时,响应不是您所看到的。

I have no idea how you can solves this since it's surely something about your specific setup, but now at least you know what the problem is. 我不知道你怎么解决这个问题,因为它肯定是你的具体设置,但现在至少你知道问题是什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM