简体   繁体   English

获取正确的新闻图片-JAVA

[英]Fetching correct news image - JAVA

I am trying to make a small news crawler. 我正在尝试制作一个小型新闻爬虫。 I got every thing working after many tries. 经过多次尝试,我使所有事情都起作用。

Problem is that approx every HTML news page have more then 50 images. 问题是大约每个HTML新闻页面都有超过50张图像。

Many of them are too small. 其中许多都太小。 So, i am filtering them simply by checking size. 因此,我只是通过检查大小来过滤它们。 Only images lager them 200x200 will be taken. 仅拍摄尺寸为200x200的图像。

But there are many images on a single page which are large. 但是单个页面上有很多图像很大。 and some news articles not have any related image. 并且一些新闻文章没有任何相关图像。

Lets take a example - Link to News - http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB-patients-in-India-undetected-Report/articleshow/24600851.cms 让我们举个例子-链接到新闻-http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB- Patients - in - India - undetected - Report / articleshow / 24600851.cms

My code got this image - Image no. 我的代码得到了这张图片-图片编号。 0 http://timesofindia.indiatimes.com/photo/10905539.cms Image height - 300 Image width - 450 0 http://timesofindia.indiatimes.com/photo/10905539.cms图像高度-300图像宽度-450

But this image is useless to image topic. 但是这个图像对图像主题没有用。 In simple words "How to get correct image dynamically" 简单地说“如何动态获取正确的图像”

I do not want to make code for each website. 我不想为每个网站编写代码。 A blank image is better then a wrong image. 空白图像比错误图像要好。

Consider the alt text. 考虑alt文本。 The alt text usually contains either the title completely or some words from the title. alt文字通常包含完整的标题或标题中的某些单词。

Also, the article does not have any relevant image associated with the title. 此外,文章没有与标题相关的任何相关图像。

I also suggest JSoup : 我还建议JSoup

jsoup: Java HTML Parser jsoup:Java HTML解析器

jsoup is a Java library for working with real-world HTML. jsoup是一个用于处理实际HTML的Java库。 It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. 它提供了使用DOM,CSS和类似jquery的最好方法提取和处理数据的非常方便的API。

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup实现WHATWG HTML5规范,并将HTML解析为与现代浏览器相同的DOM。

 scrape and parse HTML from a URL, file, or string find and extract data, using DOM traversal or CSS selectors manipulate the HTML elements, attributes, and text clean user-submitted content against a safe white-list, to prevent XSS attacks output tidy HTML 

I would recommend an approach where you identify the proximity of an image based on its position.. so, if an Image comes inside the article its probably an image about the article itself (except for ads which are very wide). 我建议您使用一种方法来根据图像的位置识别图像的接近度。因此,如果图像出现在文章中,则可能是关于文章本身的图像(非常宽的广告除外)。

you can findout the source of the image and decide if it should interest you or not. 您可以找出图片的来源,并确定图片是否值得您关注。 for instance ad images usually come from a different server which doesn't belong to the site. 例如,广告图片通常来自不属于该网站的其他服务器。 (in your case indiatimes.com). (在您的情况下为indiatimes.com)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM