簡體   English   中英

chardetect將UTF-8編碼的文件選擇為ASCII

[英]UTF-8 encoded file is picked by chardetect as ASCII

我正在編寫一個包含文件夾中所有文件的文件,我希望文本文件采用UTF-8編碼,我的代碼如下

import os
import codecs
import re
def file_concatenation(path):
    with codecs.open('C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt', 'w',encoding='utf8') as outfile:
        for root, dirs, files in os.walk(path):            
                    for dir_name in dirs:    
                        for fname in os.listdir(root+"/"+dir_name):
                            with open(root+"/"+dir_name+"/"+fname) as infile:
                                for line in infile:                                    
                                    new_line = re.sub('[^a-zA-Z]', ' ',line)                                      
                                    outfile.write(re.sub("\s\s+", " ", new_line.lstrip()))
file_concatenation('C:/Users/JAYASHREE/Documents/NLP/bbc-fulltext/bbc')

當我使用chardetect查找編碼時,它顯示為ASCII,置信度為1.0

C:\Users\JAYASHREE>chardetect "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt"
C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt: ascii with confidence 1.0

請解決問題。 謝謝

使用encoding='utf-8-sig"'強制BOM在文件開頭。應由chardetect拾取。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM