簡體 English 中英

從html文件中提取某些文本

[英]Extracting certain text from html file

原文 2013-05-18 14:01:35 9 2 java/ html/ parsing

我想從html文件中提取文本，這些文件位於parapraph（p）和link（a href）標簽之間。我想在沒有 java正則表達式和html解析器的情況下完成它。我想要

while ((word = reader.readLine()) !=null) { //iterate to the end of the file
    if(word.contains("<p>")) { //catching p tag
        while(!word.contains("</p>") { //iterate to the end of that tag
            try { //start writing
                out.write(word);
            } catch (IOException e) {
            }
        }
    }
}

但是沒有用。代碼似乎對我很有用。讀者如何能夠捕獲“p”和“a href”標簽。

2 個解決方案

當你在一行中有這樣的<p>blah</p>這樣的問題時就會出現問題。 一個簡單的解決方案是將所有< to \\n< - 更改為：

boolean insidePar = false;
while ((line = reader.readLine()) !=null) {
    for(String word in line.replaceAll("<","\n<").split("\n")){
        if(word.contains("<p>")){
            insidePar = true;
        }else if(word.contains("</p>")){
            insidePar = false;
        }
        if(insidePar){ // write the word}
    }
}

我還建議使用像@HovercraftFullOfEels這樣的解析器庫。

編輯：我已經更新了代碼，所以它更接近工作版本，但可能會遇到更多問題。

我認為使用庫會更容易。 使用這個http://jsoup.org/ 。 您還可以解析String

從Jsoup中的HTML文件提取文本信息

[英]Extracting text information from an HTML file in Jsoup

從文本文件中提取標記？

[英]Extracting tokens from a text file?

從PDF文件中提取文本

[英]Extracting text from a PDF file

使用xpath從html提取嵌套文本

[英]Extracting nested text from html using xpath

從某些站點的html頁面中提取標題

[英]Extracting the title from a html page from certain sites

讀取格式化的文本文件+提取某些信息+將其加載到JList中

[英]Reading a formatted text file + Extracting certain information + Loading it into a JList

讀取 Java 中的 CSV 文件並從讀取的文件中提取某些數據

[英]Read CSV file in Java and extracting certain data from the read file

使用Java從html文件中提取數據

[英]Extracting data from html file with Java

從文本文件Java提取令牌

[英]Extracting Tokens from a text file java

從PDF文件中提取文本和圖像

[英]extracting text AND Images from PDF file

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 從Jsoup中的HTML文件提取文本信息從文本文件中提取標記？從PDF文件中提取文本使用xpath從html提取嵌套文本從某些站點的html頁面中提取標題讀取格式化的文本文件+提取某些信息+將其加載到JList中讀取 Java 中的 CSV 文件並從讀取的文件中提取某些數據使用Java從html文件中提取數據從文本文件Java提取令牌從PDF文件中提取文本和圖像

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM