简体   繁体   English

如何检索博客文章/新闻文章的主要图像?

[英]How can I retrieve the main image of a blog post/news article?

I have a made a news aggregator Newzupp which I want to modify. 我有一个新闻聚合器Newzupp ,我想修改。 Right now I am simply displaying the titles of the news stories and I am linking them to their urls. 现在我只是显示新闻故事的标题,我将它们链接到他们的网址。

I am planning to make it more graphical, by using images + titles instead of plain titles. 我打算通过使用图像+标题而不是普通标题来使其更加图形化。 I want to know how can I get the main image of each article (somewhat similar to google news). 我想知道如何获得每篇文章的主要图像(有点类似于谷歌新闻)。

One way that I can think of is I can strip all the images and display the image which points the the same article. 我能想到的一种方法是我可以剥离所有图像并显示指向同一篇文章的图像。 But I do not think that will be efficient. 但我认为这不会有效。 Is there any other way of doing this? 有没有其他方法这样做?


I have found a solution to it. 我找到了解决方案。

  1. Fetch the contents of the url [html/xml] 获取网址[html / xml]的内容
  2. Scrape the content using hpricot 使用hpricot刮取内容
  3. Find all elements with tag "img" 找到所有带标签“img”的元素
  4. Do some research to find which of them is the main display image. 做一些研究,找出它们中的哪一个是主显示图像。 [Like 6th image in case of Wired.com's rss feed] [如果是Wired.com的rss feed的第6张图片]

I still think this is highly inefficient. 我仍然认为这是非常低效的。 I would like to know how services like Google News scrape the sites/blogs and display relevant images. 我想知道Google新闻等服务如何抓取网站/博客并显示相关图片。

也许你可以按照DOM层次结构中的图像大小或位置进行过滤/排序(即最接近正文顶部/紧跟在h1标记之后)。

What about a blacklist of advert hosts, from whom you would ignore images? 广告主机黑名单怎么样,你会忽略图像?

Since, generally speaking, adverts are hosted elsewhere while story-related images are hosted within the same domain, perhaps you could filter the page for those images that have same base url as the site itself. 一般来说,广告是在其他地方托管,而与故事相关的图像托管在同一个域中,也许您可​​以过滤那些与网站本身具有相同基本网址的图像的页面。

Why not just convert all the scraped images(using hpricot/nokogiri) to square thumbnail images(using rmagick or the likes of it or just resizing them on the server side) and group those images in one DIV just below the topic body. 为什么不直接将所有抓取的图像(使用hpricot / nokogiri)转换为方形缩略图图像(使用rmagick或类似的或仅在服务器端调整它们)并将这些图像分组到主题正文下方的一个DIV中。 You can then use a lightbox w/ slideshow to show the actual images only when the user clicks on them. 然后,您可以使用带幻灯片的灯箱仅在用户点击它们时显示实际图像。 That way it looks more graphical and still not spoil the look of your site. 这样它看起来更加图形化,仍然不会破坏您网站的外观。 Finding the most relevant image is tricky. 找到最相关的图像很棘手。

You could also try to search for OpenGraph meta tags on the pages. 您还可以尝试在页面上搜索OpenGraph元标记。 Most news sites are using the og:image property to specify the main image of an article. 大多数新闻网站都使用og:image属性来指定文章的主图像。

Example: 例:

<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" />

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何正确注释文章/博客帖子预览的可访问性和SEO的内容? - How do I properly annotate the content of an article/blog post preview for accessibility and SEO? 如何使我的主要博客内容居中? - How can I center my main blog content? 如何在主页中添加博客输出? - How can I add blog outputs in main page? 如何防止 URL 预览中的预加载器图像视图? 我想在 URL 预览中查看新闻(或博客)的主要图像 - How to prevent preloader image view in URL preview? I want to see main image of news (or blogs) in URL preview 我如何对齐 <article> 在右边,这样 <article> 下方停留在下方,我可以在此图片的左侧放置图片 <article> ? - How do I align an <article> to the right, so that the <article> below stays below is, and I can put an image on the left of this <article>? 如何将图像固定在其文章的底部? - How can I fix an image to the bottom of its article? 如何在PHP和MYSQL上保留博客文章计数器? - How can I keep a blog post counter on PHP and MYSQL? 如何在博客文章标题后面插入图像 - How to insert image behind blog post title 如何在django博客文章中添加图像 - how to add an image in a django blog post MVC-如何将新闻文章显示为具有格式的视图 - MVC - how to display a news article into a view with formatting
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM