简体   繁体   中英

How to parse this JSON with Node.js?

I'm trying to read the JSON string that is inside the <pre> element here:

http://nlp.stanford.edu:8080/corenlp/process?input=hello%20world&outputFormat=json

If I copy-paste the string with the mouse, I can JSON.parse() it. But if I read it programmatically, I get an error.

Here is my code:

var request = require('request'); // to make POST requests
var Entities = require('html-entities').AllHtmlEntities; // to decode the json string (i.e. get rid of nbsp and quot's)
var fs = require('fs')
// Set the headers
var headers = {
    'User-Agent': 'Super Agent/0.0.1',
    'Content-Type': 'application/x-www-form-urlencoded'
}

// Configure the request
var options = {
    url: 'http://nlp.stanford.edu:8080/corenlp/process',
    method: 'POST',
    headers: headers,
    form: {
        'input': 'hello world',
        'outputFormat': 'json'
    }
}

// Start the request
request(options, function(error, response, body) {
    if (!error && response.statusCode == 200) {
        // Print out the response body
        console.log("body: " + body)
        let cheerio = require('cheerio')
        let $ = cheerio.load(body)
        var inside = $('pre').text();
        inside = Entities.decode(inside.toString());
        //console.log("inside "+ inside);
        var obj = JSON.parse(inside);
        console.log(obj);
    }
})

But I get the following error:

undefined:2
  "sentences": [
^

SyntaxError: Unexpected token   in JSON at position 2
    at JSON.parse (<anonymous>)

And here is an excerpt from the output of the link, ie what I want to parse into obj :

{
&nbsp;&nbsp;&quot;sentences&quot;: [
&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&quot;index&quot;: &quot;0&quot;,
...
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;]
}

How can I JSON.parse() such a string?

Thanks,

Final Answer

Both the output and the error you presented pointed at a problem to parse a space character right after the opening JSON bracket. I suggest you remove all white spaces, that are not within quotes.

As follows:

var obj = JSON.parse(str.replace(/(\\s+?(?={))|(^\\s+)|(\\r|\\n)|((?=[\\[:,])\\s+)/gm,''));

Original Answer

I suggest you remove all white spaces.

So, var obj = JSON.parse(inside.replace(/\\s/g,'')); should work

Here is a JSFiddle example

EDIT

Better: var obj = JSON.parse(str.replace(/(\\s+?(?={))|(^\\s+)|(\\r|\\n)|((?=[\\[:,])\\s+)/gm,'')); will leave spaces inside quotes as they are, since "parse" has spaces in its value

The problem is all of those &nbsp; s. Those represent a non-breaking space character, U+00A0 . Unfortunately, JSON.parse (correctly) chokes on those characters because the JSON spec, RFC 4627 , only treats regular spaces ( U+0020 ), tabs, and line breaks as whitespace.

You could do the hacky thing, which is to replace every U+00A0 with U+0020 , but that would also affect non-breaking spaces inside of strings, which is not ideal.

The best way to handle input data like this would be to use a JSON parsing library that is more tolerant of other kinds of whitespace characters.


Why aren't you running your own copy of CoreNLP ? I imagine they don't want you scraping their server.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM