[英]How can I retrieve the main image of a blog post/news article?
I have a made a news aggregator Newzupp which I want to modify. 我有一个新闻聚合器Newzupp ,我想修改。 Right now I am simply displaying the titles of the news stories and I am linking them to their urls. 现在我只是显示新闻故事的标题,我将它们链接到他们的网址。
I am planning to make it more graphical, by using images + titles instead of plain titles. 我打算通过使用图像+标题而不是普通标题来使其更加图形化。 I want to know how can I get the main image of each article (somewhat similar to google news). 我想知道如何获得每篇文章的主要图像(有点类似于谷歌新闻)。
One way that I can think of is I can strip all the images and display the image which points the the same article. 我能想到的一种方法是我可以剥离所有图像并显示指向同一篇文章的图像。 But I do not think that will be efficient. 但我认为这不会有效。 Is there any other way of doing this? 有没有其他方法这样做?
I have found a solution to it. 我找到了解决方案。
I still think this is highly inefficient. 我仍然认为这是非常低效的。 I would like to know how services like Google News scrape the sites/blogs and display relevant images. 我想知道Google新闻等服务如何抓取网站/博客并显示相关图片。
也许你可以按照DOM层次结构中的图像大小或位置进行过滤/排序(即最接近正文顶部/紧跟在h1标记之后)。
What about a blacklist of advert hosts, from whom you would ignore images? 广告主机黑名单怎么样,你会忽略图像?
Since, generally speaking, adverts are hosted elsewhere while story-related images are hosted within the same domain, perhaps you could filter the page for those images that have same base url as the site itself. 一般来说,广告是在其他地方托管,而与故事相关的图像托管在同一个域中,也许您可以过滤那些与网站本身具有相同基本网址的图像的页面。
Why not just convert all the scraped images(using hpricot/nokogiri) to square thumbnail images(using rmagick or the likes of it or just resizing them on the server side) and group those images in one DIV just below the topic body. 为什么不直接将所有抓取的图像(使用hpricot / nokogiri)转换为方形缩略图图像(使用rmagick或类似的或仅在服务器端调整它们)并将这些图像分组到主题正文下方的一个DIV中。 You can then use a lightbox w/ slideshow to show the actual images only when the user clicks on them. 然后,您可以使用带幻灯片的灯箱仅在用户点击它们时显示实际图像。 That way it looks more graphical and still not spoil the look of your site. 这样它看起来更加图形化,仍然不会破坏您网站的外观。 Finding the most relevant image is tricky. 找到最相关的图像很棘手。
You could also try to search for OpenGraph meta tags on the pages. 您还可以尝试在页面上搜索OpenGraph元标记。 Most news sites are using the og:image
property to specify the main image of an article. 大多数新闻网站都使用og:image
属性来指定文章的主图像。
Example: 例:
<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" />
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.