I am trying to parse a .log
-file from MTurk in to a .csv
-file with rows and columns using Python. My data looks like:
P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;
Currently, I have this, which puts everything in to columns:
import pandas as pd
from pandas import read_table
log_file = '3BF51CHDTWYBE3LE8DZRA0R5AFGH0H.log'
df = read_table(log_file, sep=';|,', header=None, engine='python')
Like this:
P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|P|15605|#E5DC22|800|9|R|16108
However, I cannot seem to be able to break this in to multiple rows, so that it would look more like this:
P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|
|P|15605|#E5DC22|800|9|R|16108
ie where all the "P"s would be in one column, where all the colors would be in another, the "r"s, etc..
You can use
In [16]: df = pd.read_csv('log.txt', lineterminator=';', sep=':', header=None)
to read the file (say, 'log.txt'
) assuming that the lines are terminated by ';'
, and the separators within lines are ':'
.
Unfortunately, your second column will now contain commas, which you'd like to logically separate. You can split the commas along the lines, and concatenate the result to the first column:
In [17]: pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1)
Out[17]:
0 1 2 3 4 5
0 P 14142 GREEN 800 9 NaN
1 R 14597 7 y NaN Correct
2 P 15605 #E5DC22 800 9 NaN
3 R 16108 7 f NaN Correct
4 P 17115 GREEN 100 9 NaN
5 R 17548 7 y NaN Correct
6 P 18552 #E5DC22 100 9 NaN
7 R 18972 7 f NaN Correct
8 P 19979 GREEN 800 9 NaN
9 R 20379 7 y NaN Correct
10 P 21387 #E5DC22 800 9 NaN
11 R 21733 7 f NaN Correct
12 P 22740 RED 100 9 NaN
13 R 23139 7 y NaN False
14 P 24147 BLUE 100 9 NaN
15 R 24547 7 f NaN False
16 P 25555 RED 800 9 NaN
17 R 26043 7 b NaN Correct
18 P 27051 BLUE 800 9 NaN
19 \n\n NaN NaN NaN NaN NaN
Another faster solution:
import pandas as pd
import numpy as np
import io
temp=u"""P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=':', header=None, lineterminator=';')
print (df)
0 1
0 P ,14142,GREEN,800,9
1 R ,14597,7,y,NaN,Correct
2 P ,15605,#E5DC22,800,9
3 R ,16108,7,f,NaN,Correct
4 P ,17115,GREEN,100,9
5 R ,17548,7,y,NaN,Correct
6 P ,18552,#E5DC22,100,9
7 R ,18972,7,f,NaN,Correct
8 P ,19979,GREEN,800,9
9 R ,20379,7,y,NaN,Correct
10 P ,21387,#E5DC22,800,9
11 R ,21733,7,f,NaN,Correct
12 P ,22740,RED,100,9
13 R ,23139,7,y,NaN,False
14 P ,24147,BLUE,100,9
15 R ,24547,7,f,NaN,False
16 P ,25555,RED,800,9
17 R ,26043,7,b,NaN,Correct
18 P ,27051,BLUE,800,9
First set_index
index from first column, then remove triling ,
by strip
and create DataFrame
by str.split
. Last need add 0
to column names and reset_index
:
df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True)
df1.columns = df1.columns + 1
df1.reset_index(inplace=True)
print (df1)
0 1 2 3 4 5
0 P 14142 GREEN 800 9 None
1 R 14597 7 y NaN Correct
2 P 15605 #E5DC22 800 9 None
3 R 16108 7 f NaN Correct
4 P 17115 GREEN 100 9 None
5 R 17548 7 y NaN Correct
6 P 18552 #E5DC22 100 9 None
7 R 18972 7 f NaN Correct
8 P 19979 GREEN 800 9 None
9 R 20379 7 y NaN Correct
10 P 21387 #E5DC22 800 9 None
11 R 21733 7 f NaN Correct
12 P 22740 RED 100 9 None
13 R 23139 7 y NaN False
14 P 24147 BLUE 100 9 None
15 R 24547 7 f NaN False
16 P 25555 RED 800 9 None
17 R 26043 7 b NaN Correct
18 P 27051 BLUE 800 9 None
Timings :
def jez(df):
df1 = df.set_index(0)[1].str.strip(',').str.split(',', expand=True)
df1.columns = df1.columns + 1
df1.reset_index(inplace=True)
return (df1)
print (jez(df))
In [310]: %timeit (pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1))
100 loops, best of 3: 4.85 ms per loop
In [311]: %timeit (jez(df))
1000 loops, best of 3: 1.61 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.