如何修復 ''UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to<undefined> ''？

Question

目前，我正在嘗試讓 Python 3 程序通過 Spyder IDE/GUI 對填充了信息的文本文件進行一些操作。 但是，在嘗試讀取文件時，出現以下錯誤：

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

程序代碼如下：

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

Answer 1

正如您從https://en.wikipedia.org/wiki/Windows-1252看到的，代碼 0x9D 未在 CP1252 中定義。

“錯誤”例如在您的open函數中：您沒有指定編碼，因此 python（僅在 Windows 中）將使用一些系統編碼。 通常，如果您讀取的文件可能不是在同一台機器上創建的，那么指定編碼確實更好。

我建議在您open也放一個編碼來編寫 csv。 最好是明確的。

我不知道原始文件格式，但添加到 open , encoding='utf-8'通常是一件好事（它是 Linux 和 MacOs 的默認設置）。

Answer 2

在open語句中添加編碼例如：

f=open("filename.txt","r",encoding='utf-8')

Answer 3

以上對我不起作用，試試這個： , errors='ignore'創造奇跡！

Answer 4

如果您不需要解碼，您也可以嘗試file = open(filename, 'rb') 'rb' 轉換為讀取二進制文件。 假設您只想上傳到網站

Answer 5

errors='ignore' 解決了我的頭痛：

如何在目錄和子目錄中找到“昏迷”一詞=

import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith('.txt'):
            fullpath = os.path.join(folder, file)
            with open(fullpath, 'r', errors='ignore') as f:
                for line in f:
                    if "coma" in line:
                        print(fullpath)
                        break

Answer 6

目前，我正在嘗試通過Spyder IDE / GUI獲得一個Python 3程序，以對充滿信息的文本文件進行一些操作。 但是，當嘗試讀取文件時，出現以下錯誤：

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

該程序的代碼如下：

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

如何修復 ''UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to<undefined> ''？

問題描述

5 個解決方案

解決方案1
84 2018-03-29 18:10:59

解決方案2
50 2020-02-12 22:37:25

解決方案3
15 2018-12-08 12:55:56

解決方案4
9 2020-06-30 23:55:41

解決方案5
1 2019-11-19 14:43:15

解決方案6
-1 2019-05-13 07:03:46

如何修復 &#39;&#39;UnicodeDecodeError: &#39;charmap&#39; codec can&#39;t decode byte 0x9d in position 29815: character maps to<undefined> &#39;&#39;？

問題描述

5 個解決方案

解決方案1 84 2018-03-29 18:10:59

解決方案2 50 2020-02-12 22:37:25

解決方案3 15 2018-12-08 12:55:56

解決方案4 9 2020-06-30 23:55:41

解決方案5 1 2019-11-19 14:43:15

解決方案6 -1 2019-05-13 07:03:46

如何修復 ''UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to<undefined> ''？

解決方案1
84 2018-03-29 18:10:59

解決方案2
50 2020-02-12 22:37:25

解決方案3
15 2018-12-08 12:55:56

解決方案4
9 2020-06-30 23:55:41

解決方案5
1 2019-11-19 14:43:15

解決方案6
-1 2019-05-13 07:03:46