简体   繁体   中英

pickle a list as UTF-8

I want to import all files from one directory to my sql. But I have to make the same changes to each original.htb file first. The problem with the original file is that

  1. I don't want to import the column headers and the 2nd line because its blank

  2. I need to change \t\t\t\n to only \n so MySQL knows where fields and lines end

  3. I need to remove -----\n because it only has 1 column which doesn't match my tabe (4 columns) Here's how the original.htb file looks like:

    Beschreibung\t Kurzbeschreibung\t Einheit\t Wert\t\t\t\n

    \n

    Hub\t Hub\t mm\t 150.000000000000\t\t\t\n

    Bohrung\t Bohru\t mm\t 135.000000000000\t\t\t\n

    -----\n

so far I have managed to create a list of all files. My next step would be to write that list to 1 single file which I can then edit. The problem I have is that I get a format issue when I save the list do a file. I want the final file to have utf8 format. this is what I want my file to look like:

Hub Hub mm  150.000000000000            
Bohrung Bohru   mm  135.000000000000            

but what I get at the moment is:

”ŒHub   Hub mm  150.000000000000            
”Œ%Bohrung  Bohru   mm  135.000000000000        

Here's my code:

import os
import pickle

folderpath = r"C:/Users/l-reh/Desktop/HTB" 
filepaths  = [os.path.join("C:/Users/l-reh/Desktop/HTB/", name) for name in os.listdir(folderpath)]
all_files = []

for path in filepaths:
    with open(path, 'r') as f:
        file = f.readlines()
        all_files.append(file)

with open("C:/Users/l-reh/Desktop/Bachelorarbeit/DB Testdatensatz/HTB.htb", 'wb') as f:
    pickle.dump(all_files, f)

pickle produces a binary format, which includes per field "header" bytes (describing type, length, and for some pickle protocols, framing data) that are going to look like garbage text if you view the output as text. You can't say "I want it to be pickle , but not have these bytes" because those bytes are part of the pickle serialization format. If you don't want those bytes, you need to choose a different serialization format (presumably using a custom serializer that matches this HTB format). This has nothing to do with UTF-8 encoding or lack thereof (your input is ASCII), the problem is that you are demanding a result that's literally impossible within the limits of your design.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM