簡體   English   中英

j2me中的Html文本提取

[英]Html Text Extraction in j2me

我有一個來自html網頁的字符串,如下所示:

String htmlString =

<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great 
tributes to Motilal Nehru on occasion of 
</span>
150th birth anniversary. Pranab said institutions evolved by 
leaders like him should be strengthened instead of being destroyed. 
<span style="mso-spacerun:yes">&nbsp;
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of 
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,   
the first set of coins and postal stamps released at the function to commemorate the event.
</p> 

我需要從上面的String中提取文本,提取后我的輸出應該是這樣的

OutPut

President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed.  He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.

為此,我使用了以下邏輯:

int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex  + 1, endspanndex);

我的結果是:

President Pranab pay great tributes to Motilal Nehru on occasion of

我使用了不同的HTMLParsers,但是在j2me的情況下這些不起作用

任何人都可以幫我獲得完整的描述文字嗎? 謝謝 .....

如果您使用的是BlackBerry OS 5.0或更高版本,則可以使用BrowserField將HTML解析為DOM文檔

您可以使用與其他字符串相同的方式繼續使用。 或者,簡單的有限狀態自動機可以解決這個問題。 我在moJab procect中看到了這樣的解決方案(你可以在這里下載源代碼)。 mojab.xml包中,有一個為j2me設計的簡約XML解析器。 我的意思是它也會解析你的例子。 看看消息來源,這只是三個簡單的分支。 它似乎無需修改即可使用。

我們可以在j2me的情況下提取文本,因為它不支持HTMLParsers,如下所示:

private String removeHtmlTags(String content) {

        while (content.indexOf("<") != -1) {

            int beginTag;
            int endTag;

            beginTag = content.indexOf("<");
            endTag = content.indexOf(">");
            if (beginTag == 0) {
                content = content.substring(endTag
                        + 1, content.length());
            } else {
                content = content.substring(0, beginTag) + content.substring(endTag
                        + 1, content.length());
            }
        }
        return content;
    }

JSoup是一個非常流行的庫,用於從HTML文檔中提取文本。 這是一個這樣的例子

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM