簡體   English   中英

Node.js fs cheerio讀寫多個文件

[英]Node.js fs cheerio read and write multiple files

我從此處改編了以下代碼,將它們與Node.js和Cheerio一起使用以讀取html文件並將大型源文件拆分為小塊。 該代碼對於單個文件運行良好。

現在,我需要讀取多個大型html文件,並將它們一個接一個地拆分,然后將生成的文件輸出到一個文件夾中。 我如何讀寫文件夾中的每個文件,然后將其拆分?

這是代碼:

var cheerio = require('cheerio'),
    fs = require('fs');

fs.readFile('./sourceHtml2/testone.html', 'utf8', dataLoaded);

function dataLoaded(err, data) {

  $ = cheerio.load(data);


  $('#toplevel > div').each(function (i, elem) {

    var id = $(elem).attr('id'),

        filename = id + '.html',
        content = $.html(elem);

    fs.writeFile('./output2/' + filename, content, function (err) {

        console.log('Written html to ' + filename);
    });
  });
}

這是我的示例源文件

<!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Lorem Ipsum</title>
  </head>
  <body>
    <div id="toplevel">
      <div id="1-1">
        <h1>HTML Ipsum Presents One</h1>
        <p>
        <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

        <h2>Header Level 2</h2>
        <ol>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ol>
        <h3>Header Level 3</h3>
        <ul>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ul>
      </div>
      <div id="1-2">
        <h1>HTML Ipsum Presents Two</h1>
        <p>
        <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

        <h2>Header Level 2</h2>
        <ol>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ol>
        <blockquote>
          <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus magna. Cras in mi at felis aliquet congue. Ut a est eget ligula molestie gravida. Curabitur massa. Donec eleifend, libero at sagittis mollis, tellus est malesuada tellus,
          at luctus turpis elit sit amet quam. Vivamus pretium ornare est.</p>
        </blockquote>
        <h3>Header Level 3</h3>
        <ul>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ul>
      </div>
      <div id="1-3">
        <h1>HTML Ipsum Presents Three</h1>
        <p>
        <strong>Pellentesque habitant morbi tristique</strong>senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper. 

        <h2>Header Level 2</h2>
        <ol>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ol>
        <blockquote>
          <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus magna. Cras in mi at felis aliquet congue. Ut a est eget ligula molestie gravida. Curabitur massa. Donec eleifend, libero at sagittis mollis, tellus est malesuada tellus,
          at luctus turpis elit sit amet quam. Vivamus pretium ornare est.</p>
        </blockquote>
        <h3>Header Level 3</h3>
        <ul>
          <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
          <li>Aliquam tincidunt mauris eu risus.</li>
        </ul>
      </div>
    </div>
  </body>
</html>

對你的幫助表示感謝。

您需要將輸入目錄中的文件作為數組處理,並且還希望防止輸出文件夾中的文件名沖突。

下面提供的代碼為這兩個問題提供了解決方案。 從“輸入”子文件夾讀取HTML文件(.htm和.html),並將生成的文件寫入“輸出”子文件夾。

var cheerio = require('cheerio'),
    fs = require('fs');

// process files found in the 'input' folder
fs.readdir('./input', 'utf8', findHtmlFiles);

function findHtmlFiles(err, files) {

    if (files.length) {
        files.forEach(function (fullFilename) {
            var pattern = /\.[0-9a-z]{1,5}$/i;
            var ext = (fullFilename).match(pattern);
            // only process '.htm' and '.html' files
            if (ext[0] == '.htm' || ext[0] == '.html') {
                fs.readFile('./input/' + fullFilename, 'utf8', function (err, data) {
                    if (err)
                        throw err
                    else {
                        // add the file name to prevent collisions
                        // in the output folder
                        var fileData = {
                            file: fullFilename.slice(0, (ext[0].length * -1)),
                            data: data
                        };
                        dataLoaded(null, fileData);
                    }
                });
            }
        });
    }

}

function dataLoaded(err, fd) {

    $ = cheerio.load(fd.data);

    $('#toplevel > div').each(function (i, elem) {

        var id = $(elem).attr('id'),
            filename = fd.file + '_' + id + '.html',
            content = $.html(elem);

        fs.writeFile('./output/' + filename, content, function (err) {

            console.log('Written html to ' + filename);
        });
    });
}

控制台輸出樣例:

Written html to testone_1-1.html
Written html to testone_1-2.html
Written html to testone_1-3.html
Written html to testtwo_1-1.html
Written html to testtwo_1-2.html
Written html to testtwo_1-3.html

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM