简体   繁体   中英

Not accepting certain characters when writing to text file python

At the end of my function, I write the results to a text file, which is created because it doesn't exist, as such:

new_file = charity + ".txt"
with open(new_file, "w") as handle:
    handle.write("Matches found for " + charity.upper() + " in order of compatibility:\n")
    for item in match_lst:
            handle.write("Grant: " + item[2] + ". Funding offered: " + int_to_str(item[1]))
            handle.write("Number of matches: " + str(item[0] - 1) + "\n")
    handle.close()

My problem is that when it writes to the new file, it seems it doesn't acknowledge the newline character, the '£' character and the apostrophe character. To show what I'm talking about, here's an extract of the output file:

Matches found for BLA in order of compatibility:
Grant: The Taylor Family Foundation. Funding offered: �500,000.00Number of matches: 1
Grant: The Peter Cruddas Foundation. Funding offered: �200,000.00Number of matches: 1
Grant: The London Marathon Charitable Trust Limited - Major Capital Project 
Grants. Funding offered: �150,000.00Number of matches: 1
Grant: The Hadley Trust. Funding offered: �100,000.00Number of matches: 1
Grant: The Company Of Actuaries� Charitable Trust Fund. Funding offered: �65,000.00Number of matches: 1
Grant: The William Wates Memorial Trust. Funding offered: �50,000.00Number of matches: 1
Grant: The Nomura Charitable Trust. Funding offered: �50,000.00Number of matches: 1
Grant: The Grocers� Charity. Funding offered: �40,000.00Number of matches: 1

For reference, here is the information (ie match_lst) that I'm trying to write in its original data structure

[(2, 500000.0, 'The Taylor Family Foundation', ['Young People', 'Arts Or Heritage', 'Social Reserarch'], ['Registered Charity']), 
(2, 200000.0, 'The Peter Cruddas Foundation', ['Young People'], ['Registered Charity', 'Other']),
(2, 150000.0, 'The London Marathon Charitable Trust Limited - Major Capital Project Grants', ['Infrastructure Support', 'Sport And Recreational Activities'], ['Registered Charity', 'Limited Company', 'Other']), 
(2, 100000.0, 'The Hadley Trust', ['Social Relief And Care', 'Crime And Victimisation', 'Young People', 'Social Reserarch'], ['Registered Charity', 'Limited Company']), 
(2, 65000.0, 'The Company Of Actuaries’ Charitable Trust Fund', ['Young People', 'Disabilities', 'Social Relief And Care', 'Medical Research'], ['Registered Charity']), 
(2, 50000.0, 'The William Wates Memorial Trust', ['Young People', 'Arts Or Heritage', 'Sport And Recreational Activities'], ['Registered Charity', 'Other']), 
(2, 50000.0, 'The Nomura Charitable Trust', ['Young People', 'Education And Learning', 'Unemployment'], ['Registered Charity']), 
(2, 40000.0, 'The Grocers’ Charity', ['Poverty', 'Young People', 'Disabilities', 'Healthcare Sector', 'Arts Or Heritage'], ['Registered Charity']) ]

As you see, all the character are printed fine here.

For further context, here is my simple int_to_str function:

def int_to_str(num_int):
if num_int == 0:
    return "Discretionary"

else:
    return '£' + '{:,.2f}'.format(num_int)

So my question is how can I fix this to print all the characters that are missing/encoded?

Hard to guess without the details. Anyway it is indeed a charset problem. Let us look at some of the characters that fail to display correctly:

  • newline character - it is known to depend on the OS: it is \\n alone on Unix-like systems and \\r\\n (2 characters) on Windows.
  • '£' or POUND SIGN. It is the Unicode character U+00A3. In Windows code page 1252 or in Latin1 (ISO-8859-1) it is a single byte b'\\xa3' , while in utf8 it is encoded as b'\\xc2\\xa3' . Even more interestingly, if you try to display b'\\xa3' in UTF-8, you will get the REPLACEMENT CHARACTER U+FFFD which reads as ' ' .
  • apostrophe character. The true APOSTROPHE ( "'" ) is the ASCII character U+0027. No problem here. But it can be silently replaced with the RIGHT QUOTATION MARK (U+2019 or "'" by some unicode enabled editors. Simply it does not exists in Windows 1252 code page nor in Latin1...

All that just means that the details matter . Without knowing exactly how you read the data from the binary file nor how it was built it is not possible to explain what actually happens. A text file is an abstraction. Real text files are sequences of bytes with a given encoding and end of line conventions.

It appears that each line is being written to a new line as the strings do not appear to be one continuous text, the \\n character in the output is just hidden. To fix your encoding problem, you must specify encoding in your file open command:

with open(new_file, 'w', encoding="utf-8") as handle:
    ...

I will post this as an answer for future visitors to the question.

Thanks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM