簡體   English   中英

用grep和regex進行html解析

[英]html parsing with grep and regex

我正在制作一個shell腳本,該腳本將一座山(僅超過8000m)作為參數,並返回第一個爬上山的人的名字。 我從那里找到了一個頁面,我可以在其中解析我可以通過curl下載的信息,但是我真的不太了解我在regex周圍的方式...有人可以從這樣的html代碼中幫助我,因為山名如何讓登山者...期待

網站: http//www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/

html樣本

    <p class="wp-caption-text">Everest</p></div></div></div><p><strong>Other names: </strong>Sagamartha, Chomolangma or Qomolangma<br
/> <strong>Altitude:</strong> 8848 m<br
/> <strong>Location: </strong>Tibet / Nepal<br
/> <strong>First ascent:</strong> May 29, 1953 by Sir Edmund Hillary and Tenzing Norgay<br
/> <strong>Expedition</strong><strong>: </strong>New Zeeland/India</p><blockquote><p>&nbsp;</p><p><strong>Difficulty</strong> : <em>Mostly a non-technical climb regardless on which of the two normal routes you choose. On the south you have to deal with a dangerous ice fall and The Hillary Step, a short section of rock, on the north side there are some short technical passages. On both routes (permanent) fixed ropes are placed at the tricky sections. The altitude is main obstacle. Nowadays also crowding is mentioned as a factor of difficulty</em>.</p>

找到了另一個網站,可能更簡單: http : //www.alpineascents.com/8000m-peaks.asp

html樣本

<tr>
         <td><strong>Everest</strong></td>
         <td>8,850m <br /></td>
         <td>29,035ft</td>
         <td><div align="center">Nepal/Tibet </div></td>
         <td>1953; Sir E. Hillary, T. Norgay</td>
       </tr>

使用第一個HTML示例:

grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'

輸出:

Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller

它找到所有帶有“ First ascent”標簽的行,並捕獲by<br />標記之間by所有內容。

編輯:

原始答案未按山名篩選。 此外, <strong>First ascent:</strong>對於頁面而言過於具體(有時在:后面有一個空格)。 以下應該工作。

grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'

說明: grep -i "$1" -A3選擇與山峰的行。 -i使搜索不區分大小寫。 -A3選擇匹配行之后的3行,該行將與登山者列表一起出現。 "$1"周圍的引號適用於名稱帶有空格的山脈。

您可以使用我的Xidel ,它在html樹上進行模式匹配:

xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"

僅109個字符...

(如果Everest在參數中的腳本內,則將其替換為$ 1)

或其他網站:

xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"

首先,進入問題的第一頁。 這是“ curl”下載文件的Java刮板:

import java.util.Scanner;
import java.io.*;

public class PageInfo {
    public static void main(String[] args) {
        Scanner scan = new Scanner(new File(args[0]));  //file you downloaded
        PrintWriter output = new PrintWriter("climbers.txt");
        while (scan.hasNextLine()) {
            String s = scan.nextLine();
            if (s.contains("wp-caption-text\">") {
                s = s.split("wp-caption-text\">")[1];
                if (s.length() > 1) output.println(s.split("</p>")[0]);
            } else if (s.contains("First ascent:")) {
                s = s.split("by ")[1];
                output.println(s.split("<br")[0]);
            }
        }
        scan.close();
        output.close();
    }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM