簡體   English   中英

jsoup:如何從網頁中搜索日期文本

[英]jsoup : How to search for date text from a webpage

這就是我想要做的事情:(我想使用jsoup)

    1. 只傳遞一個網址進行解析
    2. 搜索網頁內容中提到的日期
    3. 從每個頁面內容中提取至少一個日期
    4. 將該日期轉換為標准格式

那么,Point#1我現在擁有的是什么:

String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();

現在我想知道什么樣的格式是“文檔”,是否已經從html或任何類型的網頁類型解析或什么?

然后點#2我現在擁有的:

Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);

在這里,我試圖匹配日期正則表達式來搜索頁面中的日期並存儲在一個字符串中供以后使用(第3點),但我確信我不在附近,需要幫助。

我已經完成了第4點。

那么請那些能幫助我理解並帶我走向正確方向的人如何才能達到我上面提到的那4點。

提前致謝 !

更新:所以我想要的方式:

public static void main(String[] args){
    try {
        // using USER AGENT for giving information to the server that I am a browser not a bot
        final String USER_AGENT =
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";

        // My only one url which I want to parse
        String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";

        // Creating a jsoup.Connection to connect the url with USER AGENT
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);

        // retrieving the parsed document
        Document htmlDocument = connection.get();

        /* Now till this part, I have A parsed document of the url page which is in plain-text format right?
         * If not, in which type or in which format it is stored in the variable 'htmlDocument'
         * */

        /* Now, If 'htmlDocument' holds the text format of the web page
         * Why do i need elements to find dates, because dates can be normal text in a web page,
         * So, how I am going to find an element tag for that?
         * As an example, If i wanted to collect text from <p> paragraph tag, 
         * I would use this : 
         */
        // I am not sure is it correct or not
        //***************************************************/
        Elements paragraph = htmlDocument.getElementsByTag("p");
        for(Element src: paragraph){
            System.out.println("text"+src.attr("abs:p"));
        }
       //***************************************************//

        /* But I do not want any elements to find to gather dates on the page
         * I just want to search the whole text document for date
         * So, I need a regex formatted date string which will be passed as a input for a search method
         * this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
         */

        // At the end we will use only one date from our search result and format it in a standard form

        /*
         * That is it.
         */


        /*
         * I was trying something like this
         */
        //final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
        Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Elements elements = htmlDocument.getElementsMatchingOwnText(p);

        for(Element e: elements){
            System.out.println("element = [" + e + "]");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

這是我找到的一種可能的解決方案:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * Created by ruben.alfarodiaz on 21/12/2016.
 */
@RunWith(JUnit4.class)
public class StackTest {

    @Test
    public void findDates() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
        try {
            String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
            Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
            Document htmlDocument = connection.get();
            //with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
            Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");

            //Here we find all document elements which have some element with the searched pattern  
            Elements elements = htmlDocument.getElementsMatchingText(pattern);
            //in this loop we are going to filter from all original elements to find only the leaf elements
            List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
            finalElements.stream().forEach(elem ->
                System.out.println("Node: " + elem.html())
            );

        }catch(Exception ex){
            ex.printStackTrace();
        }
    }

    //Method to decide if the current element is a leaf or contains others dates inside  
    private boolean isLastElem(Element elem, Pattern pattern) {
        return elem.getElementsMatchingText(pattern).size() <= 1;
    }

}

這一點應該根據需要添加多個模式,因為我認為復雜的找到一個匹配所有可能性的模式

編輯:最重要的是圖書館為您提供了元素的層次結構,因此您需要在它們上面找到最終的葉子。 例如

<html>
    <body>
        <div>
           20/11/2017    
        </div>
    </body>
</html>

如果我們找到模式dd / mm / yyyy,庫將返回3個元素html,body和div,但我們只對div感興趣

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM