简体   繁体   中英

Brake tags removed on x-ray scrape

I am new to JS. I am scraping a url with X-ray. The tags are removed when scraped as expected, but I want the <br> tag to be replaced with something like ;

For example: If I scrape something like 'span#scraped-portion'

<span id="scraped-portion"><span class="bold>NodeJS</span><br>
    <span class="bold>Version:</span> 8<br><span class="bold>Date released:</span> 2017 Jan<br><span class="bold>Description:</span>Some other text
</span>

I will get result similar to the following

NodeJS /n Version: 8Date released: 2017 JanDescription: Some other text

The text around <br> tags get added together and it will get difficult to understand what is what. So I want the <br> tag to be replaced replaced with something like ; .

Is it possible or Should I better use other libraries?

UPDATE

I found a pure X-Ray based solution without the need of replacing <br> tags in html prior utilizing X-Ray (see original solution below).

That way you're going to use X-Ray's filter functions in addition with embedding X-Ray functions in each other (sort of nesting).

Firstly we're going to replace <br> tags in original html by using custom filter function (called replaceLineBreak ) defined for X-Ray. Secondly we're going to use the result of replace with rebuilding the original html structure (by re-adding <span id="scraped-portion"> ) as the first argument of an X-Ray call.

Hope you'll like it!

    var x = Xray({
    filters: {
        replaceLineBreak: function (value) { return value.replace(/\<br\>/g, ';'); },
    }
});
var html =
`
    <span id="scraped-portion"><span class="bold">NodeJS</span><br>
        <span class="bold">Version:</span> 8<br><span class="bold">Date released:</span> 2017 Jan<br><span class="bold">Description:</span>Some other text
    </span>
`;

x(html,
    '#scraped-portion@html | replaceLineBreak' /// Filter function called to replace '<br>' to ';'
)(function (err, obj) {
    x(`<span id="scraped-portion">${obj}</span>`, /// Restore oroginal html structure to have the outer span with id 'scraped-portion
        '#scraped-portion'
    )(function (err2, obj2) { res.header("Content-Type", "text/html; charset=utf-8"); res.write(obj2); res.end(); })
    });

Resulting the following string:

NodeJS;   Version: 8;Date released: 2017 Jan;Description:Some other text

ORIGINAL SOLUTION

why not replacing all occurences of <br> tags prior to processing the html code by X-Ray?

function tst(req, res) {
var x = Xray();
var html =
`
    <span id="scraped-portion"><span class="bold">NodeJS</span><br>
        <span class="bold">Version:</span> 8<br><span class="bold">Date released:</span> 2017 Jan<br><span class="bold">Description:</span>Some other text
    </span>
`.replace(/\<br\>/g, ';');

x
    (
    html,
    ['span#scraped-portion']
    )(function (err, obj) { res.header("Content-Type", "text/html; charset=utf-8"); res.write(JSON.stringify(obj, null, 4)); res.end(); })
    ;
}

Then your code would result something like this

NodeJS;\\n Version: 8;Date released: 2017 Jan;Description:Some other text\\n

which pretty much seems to meet your requirements

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM