简体   繁体   English

如何使用cheerio从此html获取图像src,标题和说明?

[英]How can I get image src, title and the description from this html using cheerio?

I am trying to extract some content from website using nodejs with cheerio. 我正在尝试使用带有cheerio的nodejs从网站中提取一些内容。 I want to extract the following content: 我要提取以下内容:

  1. "This is my sample title text" text. “这是我的示例标题文本”文本。
  2. " Here will be my description content" text. “这将是我的描述内容”文本。
  3. Image src . 图片src。

Here is the html: 这是html:

     <body>
     <div class="detail_loop">
         <img class="imfast" data-original="http://www.example.com/wp-content/uploads/2017/03/imageurl-250x150.jpg" title=""
              align="left" width="250" height="150"
              src="http://www.example.com/wp-content/uploads/2017/03/imageurl-250x150.jpg" style="display: block;">
         <h2>
             <a href="http://www.example.com/2017/04/576487/" rel="bookmark">This is my titile text</a>
         </h2>
         Here will be my description content.
         <div class="clear"></div>
         <div class="send_loop" style="display: none;">
             <a href="http://www.example.com/2017/04/576487//#respond" target="_blank">
                 <div class="send_com">
                     <div class="send_bubb">
                         <div class="count">
                             0
                         </div>
                     </div>
                 </div>
             </a>
             <a href="https://www.facebook.com/sendr.php?u=http://www.example.com/2017/04/576487/" target="_blank">
                 <div class="send_fb">
                     <div class="send_bubb">
                         <div class="count">
                             send
                         </div>
                     </div>
                 </div>
             </a>
             <a href="https://twitter.com/send?url=http://www.example.com/2017/04/576487/&amp;text=this is sample title;hashtags=example"
                target="_blank">
                 <div class="send_tt">
                     <div class="send_bubb">
                         <div class="count">
                             Tweet
                         </div>
                     </div>
                 </div>
             </a>
             <div class="clear"></div>
         </div>
         <div class="clear"></div>
         <div class="detail_loop_dvd"></div>
         <div class="clear"></div>
     </div>
    </body>

Something like this what you were aiming for? 这样的目标是您想要的? You could of course simply pass the data à la: cheerio.load('<html><body>…</html>') 您当然可以简单地传递数据: cheerio.load('<html><body>…</html>')

Example Code 范例程式码

Note: .text() will return all children (other <div>, etc.), hence the filter which returns true only on text nodes. 注意: .text()将返回所有子级(其他<div>等),因此仅在文本节点上返回true的过滤器。 –[ #20832910 ] – [ #20832910 ]

const cheerio = require('cheerio');
const fs = require('fs');

/**
 * Given data saved in file 'index.html' in current path
 */
fs.readFile('index.html', {encoding: 'utf-8'}, (err, data) => {
    if (err) { console.log(err); return; }
    const $ = cheerio.load(data);

    /**
     * Print what you desire
     */
    console.log($('h2 a').text()); // Title text

    console.log($('div.detail_loop').contents().filter( function() {
            return this.type === 'text';
    }).text()); // Description content (without child nodes--only text)

    console.log($('img').attr('src')); // Image source
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM