简体   繁体   English

Python-使用.txt文件和编解码器进行replace()和utf-8编码/解码

[英]Python - replace() and utf-8 encoding / decoding using .txt files with codecs

I am having some troubles working with the whole UTF-8 de-/encoding. 我在处理整个UTF-8解码/编码时遇到了一些麻烦。 The following function should replace multiple words in given data. 以下功能应替换给定数据中的多个单词。 Background behind it is, that I am parsing multiple PDF-Documents and trying to find certain keywords for further use. 其背后的背景是,我正在解析多个PDF文档并尝试查找某些关键字以备将来使用。 But as parsing PDF to string will lead to misspellings, this function works around it. 但是由于将PDF解析为字符串会导致拼写错误,因此此功能可以解决该问题。 I shrunk the function significantly, normaly there is more replacements and more types and so on, the main problem exists in this small part, though. 我大幅缩减了功能,通常会有更多的替换品和更多的类型,依此类推,尽管主要问题仍然存在于这一小部分。

While replaceSimilar1() works perfectly fine, replaceSimilar2() will not replace the words the way I want it to. 尽管replaceSimilar1()可以很好地工作,但是replaceSimilar2()不会以我想要的方式替换单词。 The txt documents hold the exact same entries as the arrays and is saved in UTF-8. txt文档包含与数组完全相同的条目,并保存在UTF-8中。 I know, that it has to do with en-/decoding some of the parts, but no matter what I tried up until now nothing worked. 我知道,这与某些部分的编码/解码有关,但是无论我到目前为止尝试了什么,都无济于事。 There is no exception raised, it just doesn't replace the given words. 没有引发异常,只是没有替换给定的单词。

Here is my Code (including a main for testing): 这是我的代码(包括测试主代码):

# -*- coding: utf-8 -*-
import codecs


RESOURCE_PATH="resource"


def replaceSimilar1(data, type):

    ZMP_array=["zählpunkt:", "lpunkt:", "zmp:", "hipunkt:", "h punkt:", "htpunkt:", ]
    adress_array=["zirkusweg", "zirktisweg", "rkusweg", "zirnusweg", "jürgen-töpfer", "-töpfer", "jürgen-", "pfer-stras", "jürgentöpfer", "ürgenlöpfer"]

    if type=="adress":
        array=adress_array

    elif type=="zmp":
        array=ZMP_array

    else:
        array=["",""]

    for word in array:
        data=data.lower().replace(word, type)

    return data


def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
    for line in c_file.readlines():
        data=data.lower().replace(line.encode("utf-8"), type)
    c_file.close()
    return data


if __name__=="__main__":

    testString="this lpunkt: should be replaced by zmp as well as -töpfer should be replaced by adress..."
    print("testString: "+testString)

    #PART 1:
    replaced1=replaceSimilar1(testString, "zmp")
    replaced1=replaceSimilar1(replaced1, "adress")
    print("replaced 1: "+replaced1)

    # PART 2:
    replaced2=replaceSimilar2(testString, "zmp")
    replaced2=replaceSimilar2(replaced2, "adress")
    print("replaced 2: "+replaced2)

The problem is not the encoding, but the fact that when you read the file, line ends with a newline char ( \\n ). 问题不在于编码,而是事实,当您读取文件时, line以换行符char( \\n )结尾。 Use line.strip() instead, changing the function to 使用line.strip()代替,将函数更改为

def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
    for line in c_file:
        data=data.lower().replace(line.strip().encode("utf-8"), type)
    c_file.close()
    return data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM