简体   繁体   English

j2me中的Html文本提取

[英]Html Text Extraction in j2me

I've a String from html web page like this: 我有一个来自html网页的字符串,如下所示:

String htmlString =

<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great 
tributes to Motilal Nehru on occasion of 
</span>
150th birth anniversary. Pranab said institutions evolved by 
leaders like him should be strengthened instead of being destroyed. 
<span style="mso-spacerun:yes">&nbsp;
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of 
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,   
the first set of coins and postal stamps released at the function to commemorate the event.
</p> 

i need to extract the text from above String ,after extraction my out put should look like 我需要从上面的String中提取文本,提取后我的输出应该是这样的

OutPut : OutPut

President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed.  He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.

For this i have used below logic: 为此,我使用了以下逻辑:

int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex  + 1, endspanndex);

and my Resultant out put is: 我的结果是:

President Pranab pay great tributes to Motilal Nehru on occasion of

I have used Different HTMLParsers,but those are not working in case of j2me 我使用了不同的HTMLParsers,但是在j2me的情况下这些不起作用

can any one help me to get full description text? 任何人都可以帮我获得完整的描述文字吗? thanks ..... 谢谢 .....

如果您使用的是BlackBerry OS 5.0或更高版本,则可以使用BrowserField将HTML解析为DOM文档

You may continue the same way as you propose with the rest of the string. 您可以使用与其他字符串相同的方式继续使用。 Alternatively, a simple finite-state automaton would solve this. 或者,简单的有限状态自动机可以解决这个问题。 I have seen such solution in the moJab procect (you can download the sources here ). 我在moJab procect中看到了这样的解决方案(你可以在这里下载源代码)。 In the mojab.xml package, there is a minimalistic XML parser designed for j2me. mojab.xml包中,有一个为j2me设计的简约XML解析器。 I mean it would parse your example as well. 我的意思是它也会解析你的例子。 Take look at the sources, it's just three simple clases. 看看消息来源,这只是三个简单的分支。 It seems to be usable without modifications. 它似乎无需修改即可使用。

We can Extract the Text In Case of j2me as it is not suporting HTMLParsers,like this: 我们可以在j2me的情况下提取文本,因为它不支持HTMLParsers,如下所示:

private String removeHtmlTags(String content) {

        while (content.indexOf("<") != -1) {

            int beginTag;
            int endTag;

            beginTag = content.indexOf("<");
            endTag = content.indexOf(">");
            if (beginTag == 0) {
                content = content.substring(endTag
                        + 1, content.length());
            } else {
                content = content.substring(0, beginTag) + content.substring(endTag
                        + 1, content.length());
            }
        }
        return content;
    }

JSoup is a very popular library for extracting text from HTML documents. JSoup是一个非常流行的库,用于从HTML文档中提取文本。 Here is one such example of the same. 这是一个这样的例子

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM