简体   繁体   English

jsoup:如何从网页中搜索日期文本

[英]jsoup : How to search for date text from a webpage

Simply this is what I am trying to do : (I want to use jsoup) 这就是我想要做的事情:(我想使用jsoup)

    1. pass only one url to parse 只传递一个网址进行解析
    2. search for date(s) which are mentioned inside the contents of web page 搜索网页内容中提到的日期
    3. Extracts at least one date from the each page contents 从每个页面内容中提取至少一个日期
    4. convert that date into standard format 将该日期转换为标准格式

So, Point #1 What I have now : 那么,Point#1我现在拥有的是什么:

String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();

Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what? 现在我想知道什么样的格式是“文档”,是否已经从html或任何类型的网页类型解析或什么?

Then Point #2 What I have now: 然后点#2我现在拥有的:

Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);

Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here. 在这里,我试图匹配日期正则表达式来搜索页面中的日期并存储在一个字符串中供以后使用(第3点),但我确信我不在附近,需要帮助。

I have done point #4. 我已经完成了第4点。

So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above. 那么请那些能帮助我理解并带我走向正确方向的人如何才能达到我上面提到的那4点。

Thanks in Advance ! 提前致谢 !

Updated : So here how I want : 更新:所以我想要的方式:

public static void main(String[] args){
    try {
        // using USER AGENT for giving information to the server that I am a browser not a bot
        final String USER_AGENT =
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";

        // My only one url which I want to parse
        String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";

        // Creating a jsoup.Connection to connect the url with USER AGENT
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);

        // retrieving the parsed document
        Document htmlDocument = connection.get();

        /* Now till this part, I have A parsed document of the url page which is in plain-text format right?
         * If not, in which type or in which format it is stored in the variable 'htmlDocument'
         * */

        /* Now, If 'htmlDocument' holds the text format of the web page
         * Why do i need elements to find dates, because dates can be normal text in a web page,
         * So, how I am going to find an element tag for that?
         * As an example, If i wanted to collect text from <p> paragraph tag, 
         * I would use this : 
         */
        // I am not sure is it correct or not
        //***************************************************/
        Elements paragraph = htmlDocument.getElementsByTag("p");
        for(Element src: paragraph){
            System.out.println("text"+src.attr("abs:p"));
        }
       //***************************************************//

        /* But I do not want any elements to find to gather dates on the page
         * I just want to search the whole text document for date
         * So, I need a regex formatted date string which will be passed as a input for a search method
         * this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
         */

        // At the end we will use only one date from our search result and format it in a standard form

        /*
         * That is it.
         */


        /*
         * I was trying something like this
         */
        //final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
        Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Elements elements = htmlDocument.getElementsMatchingOwnText(p);

        for(Element e: elements){
            System.out.println("element = [" + e + "]");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Here is one possible solution i found: 这是我找到的一种可能的解决方案:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * Created by ruben.alfarodiaz on 21/12/2016.
 */
@RunWith(JUnit4.class)
public class StackTest {

    @Test
    public void findDates() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
        try {
            String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
            Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
            Document htmlDocument = connection.get();
            //with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
            Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");

            //Here we find all document elements which have some element with the searched pattern  
            Elements elements = htmlDocument.getElementsMatchingText(pattern);
            //in this loop we are going to filter from all original elements to find only the leaf elements
            List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
            finalElements.stream().forEach(elem ->
                System.out.println("Node: " + elem.html())
            );

        }catch(Exception ex){
            ex.printStackTrace();
        }
    }

    //Method to decide if the current element is a leaf or contains others dates inside  
    private boolean isLastElem(Element elem, Pattern pattern) {
        return elem.getElementsMatchingText(pattern).size() <= 1;
    }

}

The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities 这一点应该根据需要添加多个模式,因为我认为复杂的找到一个匹配所有可能性的模式

Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. 编辑:最重要的是图书馆为您提供了元素的层次结构,因此您需要在它们上面找到最终的叶子。 For instance 例如

<html>
    <body>
        <div>
           20/11/2017    
        </div>
    </body>
</html>

If we find for the pattern dd/mm/yyyy the library will return 3 elements html, body and div, but we are just interested in div 如果我们找到模式dd / mm / yyyy,库将返回3个元素html,body和div,但我们只对div感兴趣

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM