简体   繁体   中英

Python - replace() and utf-8 encoding / decoding using .txt files with codecs

I am having some troubles working with the whole UTF-8 de-/encoding. The following function should replace multiple words in given data. Background behind it is, that I am parsing multiple PDF-Documents and trying to find certain keywords for further use. But as parsing PDF to string will lead to misspellings, this function works around it. I shrunk the function significantly, normaly there is more replacements and more types and so on, the main problem exists in this small part, though.

While replaceSimilar1() works perfectly fine, replaceSimilar2() will not replace the words the way I want it to. The txt documents hold the exact same entries as the arrays and is saved in UTF-8. I know, that it has to do with en-/decoding some of the parts, but no matter what I tried up until now nothing worked. There is no exception raised, it just doesn't replace the given words.

Here is my Code (including a main for testing):

# -*- coding: utf-8 -*-
import codecs


RESOURCE_PATH="resource"


def replaceSimilar1(data, type):

    ZMP_array=["zählpunkt:", "lpunkt:", "zmp:", "hipunkt:", "h punkt:", "htpunkt:", ]
    adress_array=["zirkusweg", "zirktisweg", "rkusweg", "zirnusweg", "jürgen-töpfer", "-töpfer", "jürgen-", "pfer-stras", "jürgentöpfer", "ürgenlöpfer"]

    if type=="adress":
        array=adress_array

    elif type=="zmp":
        array=ZMP_array

    else:
        array=["",""]

    for word in array:
        data=data.lower().replace(word, type)

    return data


def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
    for line in c_file.readlines():
        data=data.lower().replace(line.encode("utf-8"), type)
    c_file.close()
    return data


if __name__=="__main__":

    testString="this lpunkt: should be replaced by zmp as well as -töpfer should be replaced by adress..."
    print("testString: "+testString)

    #PART 1:
    replaced1=replaceSimilar1(testString, "zmp")
    replaced1=replaceSimilar1(replaced1, "adress")
    print("replaced 1: "+replaced1)

    # PART 2:
    replaced2=replaceSimilar2(testString, "zmp")
    replaced2=replaceSimilar2(replaced2, "adress")
    print("replaced 2: "+replaced2)

The problem is not the encoding, but the fact that when you read the file, line ends with a newline char ( \\n ). Use line.strip() instead, changing the function to

def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
    for line in c_file:
        data=data.lower().replace(line.strip().encode("utf-8"), type)
    c_file.close()
    return data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM