简体   繁体   中英

converting a key value text file into a CSV file

I have a text file that needs to be converted into CSV file using pandas. A piece of it is presented in the following:

time 00:15 min
    cod,10,1=0,2=2,3=2,4=1,5=6,6=4,7=2,8=7,9=1,10=9,11=7
    cod,18,1=27,2=18,3=19,4=20,5=47,6=2,7=2,8=0,9=33,10=61,11=13,12=2,13=3,14=0,15=0

Rows are cod,10, and cod,18 and the columns are 1, 2, 3,..., 15. Any idea? Regards, Ali

I use pandas to deal with the conversion, but vanilla Python to deal with some of aspects of the data, I hope that is alright.

One issue we need to deal with is the fact that there are a different number of columns per row. So I just put NaN in columns that are missing for a row. For instance, row 1 is shorter than row 2, so the missing columns in row 1 are given values as "NaN".

Here is my idea:

import pandas as pd

lines = []
with open('/path/to/test.txt', 'r') as infile:
    for line in infile:
        if "," not in line:
            continue
        else:
            lines.append(line.strip().split(","))

row_names = []
column_data = {}

max_length = max(*[len(line) for line in lines])

for line in lines:
    while(len(line) < max_length):
        line.append(f'{len(line)-1}=NaN')

for line in lines:
    row_names.append(" ".join(line[:2]))
    for info in line[2:]:
        (k,v) = info.split("=")
        if k in column_data:
            column_data[k].append(v)
        else:
            column_data[k] = [v]

df = pd.DataFrame(column_data)
df.index = row_names
print(df)

df.to_csv('/path/to/test.csv')

Output (the printed DataFrame):

         1   2   3   4   5  6  7  8   9  10  11   12   13   14   15
cod 10   0   2   2   1   6  4  2  7   1   9   7  NaN  NaN  NaN  NaN
cod 18  27  18  19  20  47  2  2  0  33  61  13    2    3    0    0

CSV File Output:

,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,NaN,NaN,NaN,NaN
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0

You can use Python standard CSV module and its DictWriter class to handle the variability in the column names.

I like to split up multi-step tasks like these to make sure each step is progressing correctly. Here's the full code, a task-by-task description follows:

import csv

# Split text by lines, then each line by comma
cod_num_pairs = []
with open('input.txt') as f:
    next(f)  # discard first "time" line

    for line in f:
        # *pairs will hold all the pairs of 'Col_name=Val'
        cod, num, *pairs = [x.strip() for x in line.split(',')]
        cod_num_pairs.append([cod, num, pairs])

# Build headers and rows
headers = {'id': None}
all_rows = []
for cod, num, pairs in cod_num_pairs:
    id_ = cod + ' ' + num
    row = {'id': id_}

    for pair in pairs:
        col_name, val = [x.strip() for x in pair.split('=')]

        row[col_name] = val
        headers[col_name] = None

    all_rows.append(row)

# Write to CSV
with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    writer.writerows(all_rows)

When I run that, I get:

id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,,,,
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
| id     | 1  | 2  | 3  | 4  | 5  | 6 | 7 | 8 | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
|--------|----|----|----|----|----|---|---|---|----|----|----|----|----|----|----|
| cod 10 | 0  | 2  | 2  | 1  | 6  | 4 | 2 | 7 | 1  | 9  | 7  |    |    |    |    |
| cod 18 | 27 | 18 | 19 | 20 | 47 | 2 | 2 | 0 | 33 | 61 | 13 | 2  | 3  | 0  | 0  |

My first step is to deal with the lines from the text file: getting rid of the first line, and splitting the data lines up by comma ( , ). When that's done, cod_num_pairs looks like this:

print('Cod, Num, and Pairs of `Col_name=Val`')
pprint.pprint(cod_num_pairs, width=130)
Cod, Num, and Pairs of `Col_name=Val`
[
['cod', '10', ['1=0', '2=2', '3=2', '4=1', '5=6', '6=4', '7=2', '8=7', '9=1', '10=9', '11=7']],
['cod', '18', ['1=27', '2=18', '3=19', '4=20', '5=47', '6=2', '7=2', '8=0', '9=33', '10=61', '11=13', '12=2', '13=3', '14=0', '15=0']]
]

Then I move on to splitting up the Col_name=Val pairs, and creating a list of dictionaries. Each dictionary is a row that will be turned into a CSV row in the final step. I also create a dict for the headers and make sure every Col_name is added to it. When that's done:

print('Headers:')
pprint.pprint(headers, width=200, sort_dicts=False)
print('Rows:')
pprint.pprint(all_rows, width=200, sort_dicts=False)
Headers:
{'id': None, '1': None, '2': None, '3': None, '4': None, '5': None, '6': None, '7': None, '8': None, '9': None, '10': None, '11': None, '12': None, '13': None, '14': None, '15': None}
Rows:
[{'id': 'cod 10', '1': '0', '2': '2', '3': '2', '4': '1', '5': '6', '6': '4', '7': '2', '8': '7', '9': '1', '10': '9', '11': '7'},
 {'id': 'cod 18', '1': '27', '2': '18', '3': '19', '4': '20', '5': '47', '6': '2', '7': '2', '8': '0', '9': '33', '10': '61', '11': '13', '12': '2', '13': '3', '14': '0', '15': '0'}]

Finally, with a set of complete headers and a list of dict-rows, I just pass those to CSV DictWriter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM