简体   繁体   中英

Convert plain text lists into HTML lists using JavaScript

I frequently receive text lists in PDF format that are hierarchical (usually three levels deep). I would like to get these into HTML lists so that they can be styled with CSS and be made more presentable. Due to the volume of data, I am trying to automate the process with JavaScript.

Example source data:

・First Level 1 list item
– First Level 2 list item, which is a subset of the first Level 1 list item.
– Second Level 2 list item, which is a subset of the first Level 1 list item.
♦ First Level 3 list item, which is a subset of the second Level 2 list item.
・Second Level 1 list item.

Example goal:

<ul>
    <li>First Level 1 list item</li>
        <ul>
            <li>First Level 2 list item, which is a subset of the first Level 1 list item.</li>
            <li>Second Level 2 list item, which is a subset of the first Level 1 list item.
                <ul>
                    <li>First Level 3 list item, which is a subset of the second Level 2 list item.</li>
                </ul>
            </li>
        </ul>
    <li>Second Level 1 list item.</li>
</ul>

Progress so far:

I've determined that I can match Level 1 list items with this regex: /^・.+$/gm

And match Level 2 list items with this regex: /^\–.+$/gm

And Level 3 with this one: /^♦.+$/gm

Or simply delimit all of the list levels at once by combining those: string.match(/(^・.+$)|(^\–.+$)|(^♦.+$)/gm);

Now knowing how to match the different types of items, I am trying to figure out how to sort them. Conceptually, if I have them all in one array (let's use simple color coding for the next example), then it should be possible to create a function that can identify the patterns and make a multidimensional array in the correct hierarchy, and then another function to output that content padded with HTML tags in their proper places.

A visualization of conversion of a one-dimensional array to a multidimensional array based on type:

Let's say that in the simplified example above we have just a string of three letters corresponding to the colors - r , g , b .

So that might look like: rrgbrgbrgbbrggbr

I've been experimenting and trying to get this kind of structure into a multidimensional array.

I believe one dimension below Level 3 will be needed to hold the actual text strings. And one dimension above Level 1 will be needed to encompass each whole list. So a structure like this:

list
[
    level1
    [
        level2
        [
            level3
            [   
                string
                ["list item text"]
            ]
        ]
    ]
]

Here's where I'm having some trouble figuring out how to sort all of this. Any help appreciated.

Regexp not needed.

 var log = console.log; var data = `・First Level 1 list item – First Level 2 list item, which is a subset of the first Level 1 list item. – Second Level 2 list item, which is a subset of the first Level 1 list item. ♦ First Level 3 list item, which is a subset of the second Level 2 list item. ・Second Level 1 list item.`; //split text to array of string. One item per line data = data.split("\n"); var firstChar,prevFirstChar = ""; //our output struct var struct = []; var cursor = struct; //we need only one token for return to first level var lvl1Key = "・"; var prevnode = {}; data.forEach(line=>{ //get token firstChar = line.charAt(0); let node = { str: line.slice(1), child: [] }; if (firstChar == lvl1Key) { //return to root cursor = struct; } else if (firstChar.= prevFirstChar) { //move up if token change and it is not root token cursor = prevnode;child. } cursor;push(node); prevnode = node; prevFirstChar = firstChar; }); log(struct), //Ok, we get struct; convert this to html //offset for formating const offsetSize = 2: //recursive function node - array of { str, "string": childs, [nodes]} var toHtml = function(node; offset = "") { var ret = offset + "<ul>\n". offset += " ";repeat(offsetSize). node.forEach(rec=>{ ret += offset + "<li>" + rec;str + "</li>\n". //if array not empty add html for childs if (rec.child.length) { ret += toHtml(rec,child. offset + " ";repeat(offsetSize)); } }). offset = offset;slice(offsetSize); ret += offset + "</ul>\n"; return ret; } log(toHtml(struct));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM