I'm trying to read the JSON string that is inside the <pre>
element here:
http://nlp.stanford.edu:8080/corenlp/process?input=hello%20world&outputFormat=json
If I copy-paste the string with the mouse, I can JSON.parse()
it. But if I read it programmatically, I get an error.
Here is my code:
var request = require('request'); // to make POST requests
var Entities = require('html-entities').AllHtmlEntities; // to decode the json string (i.e. get rid of nbsp and quot's)
var fs = require('fs')
// Set the headers
var headers = {
'User-Agent': 'Super Agent/0.0.1',
'Content-Type': 'application/x-www-form-urlencoded'
}
// Configure the request
var options = {
url: 'http://nlp.stanford.edu:8080/corenlp/process',
method: 'POST',
headers: headers,
form: {
'input': 'hello world',
'outputFormat': 'json'
}
}
// Start the request
request(options, function(error, response, body) {
if (!error && response.statusCode == 200) {
// Print out the response body
console.log("body: " + body)
let cheerio = require('cheerio')
let $ = cheerio.load(body)
var inside = $('pre').text();
inside = Entities.decode(inside.toString());
//console.log("inside "+ inside);
var obj = JSON.parse(inside);
console.log(obj);
}
})
But I get the following error:
undefined:2
"sentences": [
^
SyntaxError: Unexpected token in JSON at position 2
at JSON.parse (<anonymous>)
And here is an excerpt from the output of the link, ie what I want to parse into obj
:
{
"sentences": [
{
"index": "0",
...
}
]
}
How can I JSON.parse()
such a string?
Thanks,
Final Answer
Both the output and the error you presented pointed at a problem to parse a space character right after the opening JSON bracket. I suggest you remove all white spaces, that are not within quotes.
As follows:
var obj = JSON.parse(str.replace(/(\\s+?(?={))|(^\\s+)|(\\r|\\n)|((?=[\\[:,])\\s+)/gm,''));
Original Answer
I suggest you remove all white spaces.
So, var obj = JSON.parse(inside.replace(/\\s/g,''));
should work
Here is a JSFiddle example
EDIT
Better: var obj = JSON.parse(str.replace(/(\\s+?(?={))|(^\\s+)|(\\r|\\n)|((?=[\\[:,])\\s+)/gm,''));
will leave spaces inside quotes as they are, since "parse" has spaces in its value
The problem is all of those
s. Those represent a non-breaking space character, U+00A0
. Unfortunately, JSON.parse
(correctly) chokes on those characters because the JSON spec, RFC 4627 , only treats regular spaces ( U+0020
), tabs, and line breaks as whitespace.
You could do the hacky thing, which is to replace every U+00A0
with U+0020
, but that would also affect non-breaking spaces inside of strings, which is not ideal.
The best way to handle input data like this would be to use a JSON parsing library that is more tolerant of other kinds of whitespace characters.
Why aren't you running your own copy of CoreNLP ? I imagine they don't want you scraping their server.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.