I'm using the Beautiful Soup library to parse the contents of a web page and print the results into a .txt file. This mostly works but I can't get rid of certain unicode character codes that appear in the text output. For example:
"Failed to investigate issue with customer\’s terminal."
I have been using the "io" library to encode the output as utf-8
. I have tried changing the encoding to ascii, but this doesn't work either.
def open_file(file):
with open((file), encoding='utf-8') as input_data:
global soup
soup = BeautifulSoup(input_data)
return soup
# stuff happens here to parse the html and prepare a list of dictionaries containing the content I want to print.
# this prepares the output
def dict_writer(dict_list, filename):
with io.open('%s.txt' % filename, 'w', encoding="utf-8") as f:
for dict in dict_list:
content = json.dumps(dict.get("content"))
loc_no = json.dumps(dict.get("location_number"))
page_no = json.dumps(dict.get("page_number"))
f.write("\n")
f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")
f.write("\n")
I read the article below to try to get a general understanding of how character encoding works. It seems like if I encoded the content in the open_file
function, encoding the output in the same standard in the dict_writer
function should work.
The reason you're getting non-ASCII characters encoded with \\u\u003c/code> is that you're using
json.dumps
.
As you can see from the docs , the
ensure_ascii
parameter defaults to True
, and, if true, "the output is guaranteed to have all incoming non-ASCII characters escaped".
So, you could just add
ensure_ascii=False
to all of your dumps
calls.
But really, why are you using
json.dumps
in the first place? The format you're outputting isn't a JSON file. In fact, it seems to be something designed for human rather than computer consumption. So why do you want extra quotes, escape characters, etc. to make the parts of it JSON-parseable even though the whole isn't? It would be much simpler, and probably produce nicer output, if you just didn't do that:
content = dict.get("content")
loc_no = str(dict.get("location_number"))
page_no = str(dict.get("page_number"))
f.write("\n")
f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")
… or, even better:
content = dict.get("content")
loc_no = dict.get("location_number")
page_no = dict.get("page_number")
f.write("\n")
f.write("{} ({}, {})\n".format(content, page_no, loc_no)
While we're at it, calling your dict
dict
is confusing (and means you can't access the dict
constructor in the rest of your function without getting one of those errors that will keep you up all night debugging and then feeling like an idiot).
Also, why are you using
get("content")
here?
If you don't have to worry about cases where there is no
content
, just use ["content"]
—or, even more simply, just pass the dict to format_map
:
for ref in refs:
f.write("\n{content} ({page_number}, {location_number})\n".format_map(ref))
If you do need to worry about such cases, surely you want some appropriate human-meaningful string, not
None
. For example:
for ref in refs:
content = ref.get("content", "-- content missing --")
page_no = ref.get("page_number", "N/A")
loc_no = ref.get("location_number", "N/A")
f.write("\n{} ({}, {})\n".format(content, page_no, loc_no)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.