简体   繁体   English

如何从HTML提取内容

[英]How to Extract Content From HTML

I have HTML as string and i want to extract just "post_titles" from it. 我有HTML作为字符串,我想从中提取“ post_titles”。 this is the HTML string: 这是HTML字符串:

<div class="hidden" id="inline_49">
<div class="post_title">Single parenting</div>
<div class="post_name">single-parenting</div>
<div class="post_author">90307285</div>
<div class="comment_status">open</div>
<div class="ping_status">open</div>
<div class="_status">publish</div>
<div class="jj">20</div>
<div class="mm">07</div>
<div class="aa">2015</div>
<div class="hh">00</div>
<div class="mn">52</div>
<div class="ss">33</div>

This has the post title as "Single parenting" which is what i want to extract. 这是我想提取的标题为“单身育儿”的帖子。 This is what i am using : 这就是我正在使用的:

Elements link = doc.select("div[class=post_title]");
String title = link.text();

But this is giving a blank string. 但这给出了一个空白字符串。 I also tried: 我也尝试过:

Elements link = doc.select("div[id=inline_49]").select("div[class=post_title]");
String title = link.text();

This is also giving a blank string. 这也给出了一个空白字符串。 Please help me what selector exactly I need to use to extract the title. 请帮助我提取标题所需的选择器。

Try this, but make sure your HTML text is well formatted in the String : 试试看,但是要确保您的HTML文本在String中格式正确:

String html = "<div class=\"hidden\" id=\"inline_49\">" +
            "<div class=\"post_title\">Single parenting</div>" +
            "<div class=\"post_name\">single-parenting</div>" +
            "<div class=\"post_author\">90307285</div>";

Document document = Jsoup.parse(html);
Elements divElements = document.select("div");
for(Element div : divElements) {
    if(div.attr("class").equals("post_title")) {
       System.out.println(div.ownText());
    }
}

You must include a cookie in your request. 您必须在请求中包含一个cookie。 Check this Java code: 检查以下Java代码:

try {

            String url = "https://ssblecturate.wordpress.com/wp-login.php";

            Connection.Response response = Jsoup.connect(url)
                    .data("log", "your_login_here") // your wordpress login
                    .data("pwd", "your_password_here") // your wordpress password
                    .data("rememberme", "forever")
                    .data("wp-submit", "Log In")
                    .method(Connection.Method.POST)
                    .followRedirects(true)
                    .execute();

            Document document = Jsoup.connect("https://ssblecturate.wordpress.com/wp-admin/edit.php")
                    .cookies(response.cookies())
                    .get();

            Element titleElement= document.select("div[class=post_title]").first();
            System.out.println(titleElement.text());

        } catch (IOException e) {
            e.printStackTrace();
        }

Updated ! 更新 ! Hope It works for you : 希望这对你有用 :

//Get div tag with class name is 'post_title'

Document doc;
    try {
        File input = new File("D:\\JAVA\\J2EE\\Bin\\Bin\\Project\\xml\\src\\demo\\index.html");
        doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
        //Get div tag with class name is 'post_title'
        Element element = doc.select("div.post_title").first();
        System.out.println(element.html());
    } catch (Exception e) {
        e.printStackTrace();
    }

If you have it in a String, you can try with regExp . 如果您在String中拥有它,则可以尝试使用regExp

This regex means "everything between with class post_title (not exactly but yes for your sample). 这个正则表达式的意思是“介于post_title类之间的所有内容(不完全相同,但对于您的示例是”)。

String exp = "<div class=\"post_title\">([^<]*)</div>"

You should be able to get the content with: 您应该能够通过以下方式获取内容:

String post_title = Pattern.compile(exp).matcher(yourString).group(1);

NOTE: I guess your post_title does not contain "<"... This should indeed generate an XML structure error. 注意:我想您的post_title不包含“ <” ...这确实应该生成XML结构错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM