I have a text file that needs to be converted into CSV file using pandas. A piece of it is presented in the following:
time 00:15 min
cod,10,1=0,2=2,3=2,4=1,5=6,6=4,7=2,8=7,9=1,10=9,11=7
cod,18,1=27,2=18,3=19,4=20,5=47,6=2,7=2,8=0,9=33,10=61,11=13,12=2,13=3,14=0,15=0
Rows are cod,10, and cod,18 and the columns are 1, 2, 3,..., 15. Any idea? Regards, Ali
I use pandas to deal with the conversion, but vanilla Python to deal with some of aspects of the data, I hope that is alright.
One issue we need to deal with is the fact that there are a different number of columns per row. So I just put NaN in columns that are missing for a row. For instance, row 1 is shorter than row 2, so the missing columns in row 1 are given values as "NaN".
Here is my idea:
import pandas as pd
lines = []
with open('/path/to/test.txt', 'r') as infile:
for line in infile:
if "," not in line:
continue
else:
lines.append(line.strip().split(","))
row_names = []
column_data = {}
max_length = max(*[len(line) for line in lines])
for line in lines:
while(len(line) < max_length):
line.append(f'{len(line)-1}=NaN')
for line in lines:
row_names.append(" ".join(line[:2]))
for info in line[2:]:
(k,v) = info.split("=")
if k in column_data:
column_data[k].append(v)
else:
column_data[k] = [v]
df = pd.DataFrame(column_data)
df.index = row_names
print(df)
df.to_csv('/path/to/test.csv')
Output (the printed DataFrame):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cod 10 0 2 2 1 6 4 2 7 1 9 7 NaN NaN NaN NaN
cod 18 27 18 19 20 47 2 2 0 33 61 13 2 3 0 0
CSV File Output:
,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,NaN,NaN,NaN,NaN
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
You can use Python standard CSV module and its DictWriter class to handle the variability in the column names.
I like to split up multi-step tasks like these to make sure each step is progressing correctly. Here's the full code, a task-by-task description follows:
import csv
# Split text by lines, then each line by comma
cod_num_pairs = []
with open('input.txt') as f:
next(f) # discard first "time" line
for line in f:
# *pairs will hold all the pairs of 'Col_name=Val'
cod, num, *pairs = [x.strip() for x in line.split(',')]
cod_num_pairs.append([cod, num, pairs])
# Build headers and rows
headers = {'id': None}
all_rows = []
for cod, num, pairs in cod_num_pairs:
id_ = cod + ' ' + num
row = {'id': id_}
for pair in pairs:
col_name, val = [x.strip() for x in pair.split('=')]
row[col_name] = val
headers[col_name] = None
all_rows.append(row)
# Write to CSV
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(all_rows)
When I run that, I get:
id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
cod 10,0,2,2,1,6,4,2,7,1,9,7,,,,
cod 18,27,18,19,20,47,2,2,0,33,61,13,2,3,0,0
| id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|--------|----|----|----|----|----|---|---|---|----|----|----|----|----|----|----|
| cod 10 | 0 | 2 | 2 | 1 | 6 | 4 | 2 | 7 | 1 | 9 | 7 | | | | |
| cod 18 | 27 | 18 | 19 | 20 | 47 | 2 | 2 | 0 | 33 | 61 | 13 | 2 | 3 | 0 | 0 |
My first step is to deal with the lines from the text file: getting rid of the first line, and splitting the data lines up by comma ( ,
). When that's done, cod_num_pairs
looks like this:
print('Cod, Num, and Pairs of `Col_name=Val`')
pprint.pprint(cod_num_pairs, width=130)
Cod, Num, and Pairs of `Col_name=Val`
[
['cod', '10', ['1=0', '2=2', '3=2', '4=1', '5=6', '6=4', '7=2', '8=7', '9=1', '10=9', '11=7']],
['cod', '18', ['1=27', '2=18', '3=19', '4=20', '5=47', '6=2', '7=2', '8=0', '9=33', '10=61', '11=13', '12=2', '13=3', '14=0', '15=0']]
]
Then I move on to splitting up the Col_name=Val
pairs, and creating a list of dictionaries. Each dictionary is a row that will be turned into a CSV row in the final step. I also create a dict for the headers and make sure every Col_name
is added to it. When that's done:
print('Headers:')
pprint.pprint(headers, width=200, sort_dicts=False)
print('Rows:')
pprint.pprint(all_rows, width=200, sort_dicts=False)
Headers:
{'id': None, '1': None, '2': None, '3': None, '4': None, '5': None, '6': None, '7': None, '8': None, '9': None, '10': None, '11': None, '12': None, '13': None, '14': None, '15': None}
Rows:
[{'id': 'cod 10', '1': '0', '2': '2', '3': '2', '4': '1', '5': '6', '6': '4', '7': '2', '8': '7', '9': '1', '10': '9', '11': '7'},
{'id': 'cod 18', '1': '27', '2': '18', '3': '19', '4': '20', '5': '47', '6': '2', '7': '2', '8': '0', '9': '33', '10': '61', '11': '13', '12': '2', '13': '3', '14': '0', '15': '0'}]
Finally, with a set of complete headers and a list of dict-rows, I just pass those to CSV DictWriter.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.