简体   繁体   中英

Get scraped data from facebook graph api using url reliably

My app needs to get data facebook scrapes from URLs. Up until now we were getting it using

POST /?id={object-instance-id or object-url}&scrape=true

Which is detailed in updating object section in https://developers.facebook.com/docs/sharing/opengraph/using-objects

For example

POST /?id=http://google.com
{
  "url": "http://www.google.com/",
  "type": "website",
  "title": "Google",
  "image": [
  {
  "url": "http://www.google.com/images/branding/googleg/1x/googleg_standard_color_128dp.png"
    }
  ],
  "description": "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.",
  "updated_time": "2015-10-06T11:34:58+0000",
  "id": "381702034999"
}

Notice image section.

Unfortunately if og tags are configured wrongly on the server

POST /?id=http://some.page.with.bad.tags.com
{
  "error": {
    "message": "Invalid parameter",
    "type": "FacebookApiException",
    "code": 100,
    "error_subcode": 1611016,
    "is_transient": false,
    "error_user_title": "Object Invalid Value",
    "error_user_msg": "Object at URL 'http://some.page.with.bad.tags' of type '' is invalid because the given value '/some-bad-value' for property 'og:url' could not be parsed as type 'url'.",
    "fbtrace_id": "abcabcabc"
  }

}

Which returns nothing interesting.

Attempt to GET the url returns this:

{
  "og_object": {
    "id": "381702034999",
    "description": "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.",
    "title": "Google",
    "type": "website",
    "updated_time": "2015-10-06T11:40:04+0000",
    "url": "http://www.google.com/"
  },
  "share": {
    "comment_count": 2,
    "share_count": 13494003
  },
  "id": "http://www.google.com"
}

Which misses image section. I cannot find in the documentation any way to retrieve result with images without using POST /?id={url}, but that fails on any errors in og tags.

GET /{ObjectId}

returns only type and created_time

Entering the same broken link in https://developers.facebook.com/tools/debug/ results in page which contains image, description, title and captions for the page. Which I need. So it means facebook stores them, even though page has wrong tags, but I need a way to fetch them. Unfortunatelly I cannot provide link for broken url due to NDA, and I couldn't find other page with broken tags.

If the page contains invalid Open Graph markup this seems expected, also do not confuse the Graph API with some sort of datasource or a scraping service you can utilize to generate the preview for web content.

If Facebook , for whatever reason, can't parse the Open Graph tags of a URL it will try to make a good guess based on the content of the page (large junks of text, images it finds, title tags etc.) to build the preview so you might get some sort of data back from GET /{object-id} which can just be a guess instead of actual og:.. data.

In case you really need a more or less failsafe solution, you could build your own scraper that looks for Open Graph tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM