简体   繁体   中英

How to Replace Unicode Character Codes in TXT Output

I'm using the Beautiful Soup library to parse the contents of a web page and print the results into a .txt file. This mostly works but I can't get rid of certain unicode character codes that appear in the text output. For example:

"Failed to investigate issue with customer\’s terminal."

I have been using the "io" library to encode the output as utf-8 . I have tried changing the encoding to ascii, but this doesn't work either.

def open_file(file):
    with open((file), encoding='utf-8') as input_data:
        global soup
        soup = BeautifulSoup(input_data)
        return soup

# stuff happens here to parse the html and prepare a list of dictionaries containing the content I want to print.

# this prepares the output

def dict_writer(dict_list, filename):
    with io.open('%s.txt' % filename, 'w', encoding="utf-8") as f:
        for dict in dict_list:
            content = json.dumps(dict.get("content"))
            loc_no = json.dumps(dict.get("location_number"))
            page_no = json.dumps(dict.get("page_number"))
            f.write("\n")
            f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")
            f.write("\n")

I read the article below to try to get a general understanding of how character encoding works. It seems like if I encoded the content in the open_file function, encoding the output in the same standard in the dict_writer function should work.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

The reason you're getting non-ASCII characters encoded with \\u\u003c/code> is that you're using json.dumps . As you can see from the docs , the ensure_ascii parameter defaults to True , and, if true, "the output is guaranteed to have all incoming non-ASCII characters escaped".

So, you could just add ensure_ascii=False to all of your dumps calls.

But really, why are you using json.dumps in the first place? The format you're outputting isn't a JSON file. In fact, it seems to be something designed for human rather than computer consumption. So why do you want extra quotes, escape characters, etc. to make the parts of it JSON-parseable even though the whole isn't? It would be much simpler, and probably produce nicer output, if you just didn't do that:

content = dict.get("content")
loc_no = str(dict.get("location_number"))
page_no = str(dict.get("page_number"))
f.write("\n")
f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")

… or, even better:

content = dict.get("content")
loc_no = dict.get("location_number")
page_no = dict.get("page_number")
f.write("\n")
f.write("{} ({}, {})\n".format(content, page_no, loc_no)

While we're at it, calling your dict dict is confusing (and means you can't access the dict constructor in the rest of your function without getting one of those errors that will keep you up all night debugging and then feeling like an idiot).

Also, why are you using get("content") here?

If you don't have to worry about cases where there is no content , just use ["content"] —or, even more simply, just pass the dict to format_map :

for ref in refs:
    f.write("\n{content} ({page_number}, {location_number})\n".format_map(ref))

If you do need to worry about such cases, surely you want some appropriate human-meaningful string, not None . For example:

for ref in refs:
    content = ref.get("content", "-- content missing --")
    page_no = ref.get("page_number", "N/A")
    loc_no = ref.get("location_number", "N/A")
    f.write("\n{} ({}, {})\n".format(content, page_no, loc_no)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM