简体   繁体   English

使用 url 可靠地从 facebook graph api 获取抓取的数据

[英]Get scraped data from facebook graph api using url reliably

My app needs to get data facebook scrapes from URLs.我的应用程序需要从 URL 获取数据 facebook 抓取。 Up until now we were getting it using到目前为止,我们一直在使用它

POST /?id={object-instance-id or object-url}&scrape=true

Which is detailed in updating object section in https://developers.facebook.com/docs/sharing/opengraph/using-objectshttps://developers.facebook.com/docs/sharing/opengraph/using-objects中的更新对象部分中有详细说明

For example例如

POST /?id=http://google.com
{
  "url": "http://www.google.com/",
  "type": "website",
  "title": "Google",
  "image": [
  {
  "url": "http://www.google.com/images/branding/googleg/1x/googleg_standard_color_128dp.png"
    }
  ],
  "description": "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.",
  "updated_time": "2015-10-06T11:34:58+0000",
  "id": "381702034999"
}

Notice image section.注意图片部分。

Unfortunately if og tags are configured wrongly on the server不幸的是,如果 og 标签在服务器上配置错误

POST /?id=http://some.page.with.bad.tags.com
{
  "error": {
    "message": "Invalid parameter",
    "type": "FacebookApiException",
    "code": 100,
    "error_subcode": 1611016,
    "is_transient": false,
    "error_user_title": "Object Invalid Value",
    "error_user_msg": "Object at URL 'http://some.page.with.bad.tags' of type '' is invalid because the given value '/some-bad-value' for property 'og:url' could not be parsed as type 'url'.",
    "fbtrace_id": "abcabcabc"
  }

} }

Which returns nothing interesting.它没有返回任何有趣的东西。

Attempt to GET the url returns this:尝试获取 url 返回:

{
  "og_object": {
    "id": "381702034999",
    "description": "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.",
    "title": "Google",
    "type": "website",
    "updated_time": "2015-10-06T11:40:04+0000",
    "url": "http://www.google.com/"
  },
  "share": {
    "comment_count": 2,
    "share_count": 13494003
  },
  "id": "http://www.google.com"
}

Which misses image section.哪个错过了图像部分。 I cannot find in the documentation any way to retrieve result with images without using POST /?id={url}, but that fails on any errors in og tags.我在文档中找不到任何方法来检索图像结果而不使用 POST /?id={url},但是在 og 标签中出现任何错误时都会失败。

GET /{ObjectId}

returns only type and created_time只返回类型和 created_time

Entering the same broken link in https://developers.facebook.com/tools/debug/ results in page which contains image, description, title and captions for the page.https://developers.facebook.com/tools/debug/ 中输入相同的断开链接会导致页面包含该页面的图像、描述、标题和标题。 Which I need.我需要的。 So it means facebook stores them, even though page has wrong tags, but I need a way to fetch them.所以这意味着 facebook 存储它们,即使页面有错误的标签,但我需要一种方法来获取它们。 Unfortunatelly I cannot provide link for broken url due to NDA, and I couldn't find other page with broken tags.不幸的是,由于保密协议,我无法提供损坏 url 的链接,而且我找不到其他带有损坏标签的页面。

If the page contains invalid Open Graph markup this seems expected, also do not confuse the Graph API with some sort of datasource or a scraping service you can utilize to generate the preview for web content.如果页面包含无效的 Open Graph 标记,这似乎是意料之中的,也不要将 Graph API 与某种数据源或可用于生成 Web 内容预览的抓取服务混淆。

If Facebook , for whatever reason, can't parse the Open Graph tags of a URL it will try to make a good guess based on the content of the page (large junks of text, images it finds, title tags etc.) to build the preview so you might get some sort of data back from GET /{object-id} which can just be a guess instead of actual og:.. data.如果 Facebook 出于某种原因无法解析 URL 的 Open Graph 标签,它将尝试根据页面内容(大量文本、找到的图像、标题标签等)进行正确的猜测以构建预览,因此您可能会从GET /{object-id}获取某种数据,这只是猜测而不是实际的og:..数据。

In case you really need a more or less failsafe solution, you could build your own scraper that looks for Open Graph tags.如果您确实需要或多或少的故障安全解决方案,您可以构建自己的抓取工具来查找 Open Graph 标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM