简体   繁体   English

如何像Facebook一样从网页中提取图像?

[英]How to extract images from a webpage as Facebook does?

If I insert in my wall a link like this: 如果我在墙上插入这样的链接:

http://blog.bonsai.tv/news/il-nuovo-vezzo-della-lega-nord-favorire-i-lombardi-alluniversita/ http://blog.bonsai.tv/news/il-nuovo-vezzo-della-lega-nord-favorire-i-lombardi-alluniversita/

then facebook extract the image in the post and not the first image in the webpage ( not image logo or other little images for example ) !! 然后facebook提取帖子中的图像,而不是网页中的第一张图像(例如,不是图像徽标或其他小图像)!

How facebook does that ? 脸书怎么办?

Hm, impossible to say without more information about the algorithm they use. 嗯,没有关于他们使用的算法的更多信息就很难说。

However, from looking at the page's source code you can see that while the image of Bossi is not the first image in the page, it's the first inside the divs "page_content" and "post_content". 但是,通过查看页面的源代码,您可以看到,虽然Bossi的图像不是页面中的第一张图像,但它是divs中的“ page_content”和“ post_content”中的第一张图像。 Maybe Facebooks knows the HTML IDs that the blogging system (Wordpress in this case) uses, and uses these to find the first image that is actually part of the page content. 也许Facebook知道博客系统(在本例中为Wordpress)使用的HTML ID,并使用这些ID查找实际上是页面内容一部分的第一张图像。

That would actually be a good idea, and is essentially an implementation of the "semantic web"... 那实际上是一个好主意,本质上是“语义网”的一种实现。

As others have said, we have no idea how Facebook decides what to choose in the absence of any relevant metadata (though Sleske's guesses seem reasonable; I'd also guess that they look at the first big image), but you can avoid that by going the correct route and simply giving facebook (and similar services) addiotnal metadata about your page by using Open Graph Protocol tags, for example if you want to specify a particular image to use for a facebook like, you'd include this in your head tag: 正如其他人所说,我们不知道在没有任何相关元数据的情况下Facebook如何决定选择什么(尽管Sleske的猜测似乎是合理的;我也想他们会看第一张大图),但是您可以避免这种情况遵循正确的路线,并通过使用“ 开放图谱协议”标签简单地为Facebook(和类似服务)提供有关页面的附加元数据,例如,如果您要指定用于Facebook之类的特定图像,则将其包括在您的脑海中标签:

<meta property="og:image" content="<your image URL>" />

OGP is also used by LinkedIn, Google+ and many others. LinkedIn,Google +和许多其他公司也使用OGP。

If you're in Wordpress you can control these tags with an open graph plugin . 如果您使用的是Wordpress,则可以使用打开的图形插件控制这些标签。 Other systems can do it manually or via their own plugins. 其他系统可以手动执行此操作,也可以通过自己的插件执行此操作。

I can imagine that the Facebook crawler can identify the actual content part, and select an image from it. 我可以想象Facebook搜寻器可以识别实际的内容部分,并从中选择图片。 Similar functionality is used by the Safari Reader functionality . Safari Reader功能使用了类似的功能 It probably helps that the software used is Wordpress, which is the most popular blogging software. 可能最有用的软件是Wordpress,这是最流行的博客软件。 It's a quick win for Facebook to add specific support for this software. 对于Facebook来说,添加对该软件的特定支持是一项捷径。

My guess is facebook has built some algorithms for distinguishing the actual content from the other data in a html page. 我的猜测是,facebook建立了一些算法,可将实际内容与html页面中的其他数据区分开。 When looking at the page you provided it's quite easy since the html element that contains the page content has id="page_content" which is self-explanatory. 在查看您提供的页面时,这很容易,因为包含页面内容的html元素具有id =“ page_content”,这是不言而喻的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM