[英]How to read unstructured csv in pandas
I have got a messy csv file (just extension is csv). 我有一个凌乱的csv文件(只是扩展名为csv)。 But when i open this file in ms excel with
;
但是当我用ms excel打开这个文件时
;
delimited it looks like as below(dummy sample)- 划分它看起来如下(虚拟样本) -
I investigated this file and found following: 我调查了这个文件,发现如下:
Question: 题:
How can i read this table in pandas whereas all existing columns(headers) remain and blank columns are filled with consecutive numbers caring variable length of rows. 如何在pandas中读取此表,而所有现有列(标题)保留,空白列填充连续数字,可以控制行的可变长度。
In fact i want to take 8 cell-value again and again until any row exhausts. 事实上,我想一次又一次地取8个单元格值,直到排出任何一排。 from the header-less columns for analysis.
从无标题列进行分析。
NB I have tried usecols
, names
, skiprows
, sep
etc in read_csv
but with no success 注:我已经试过
usecols
, names
, skiprows
, sep
在等read_csv
,但没有成功
Added sample input and expected output (formatting is worse but pandas.read_clipboard(
) should work) 添加了样本输入和预期输出(格式化更糟,但
pandas.read_clipboard(
)应该有效)
INPUT INPUT
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] )
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
Expected OUTPUT(dataframe) 预期的输出(数据帧)
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] 1 2 3 4 5 6 7 8 9 10 11 12
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
Preprocessing 预处理
Function get_names()
open file, check max length of splitted rows. 函数
get_names()
打开文件,检查分割行的最大长度。 Then I read first row and add missing values from max length. 然后我读取第一行并从最大长度添加缺失值。
Last value of first row is )
, so I remove it by firstline[:-1]
and then I add to range missing columns by +1
rng = range(1, m - lenfirstline + 2)
. 第一行的最后一个值是
)
,所以我通过firstline[:-1]
删除它,然后我按+1
rng = range(1, m - lenfirstline + 2)
添加范围缺失列。 +2
is because range starts from value 1
. +2
是因为范围从值1
开始。
Then you can use function read_csv
, skipp first line and as names use output from get_names()
. 然后你可以使用函数
read_csv
,skipp第一行和名称使用get_names()
输出。
import pandas as pd
import csv
#preprocessing
def get_names():
with open('test/file.txt', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
if i ==0:
firstline = ''.join(row).split()
lenfirstline = len(firstline)
#print firstline, lenfirstline
num.append(len(''.join(row).split()))
m = max(num)
rng = range(1, m - lenfirstline + 2)
#remove )
rng = firstline[:-1] + rng
return rng
#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
print df
car_type entry_gate entry_time(ms) exit_gate exit_time(ms) \
car_id
24 Bus 25 4300 26 48520
25 Car 25 20 26 45900
26 Car - - - -
27 Car - - - -
28 Car - - - -
traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] \
car_id
24 118.47 2.678999 509552.78 5039855.59
25 113.91 2.482746 509583.70 5039848.78
26 109.68 8.859805 509572.75 5039862.75
27 119.84 3.075936 509582.73 5039862.78
28 129.64 4.347466 509591.07 5039862.90
speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] \
car_id
24 10.0740 0.4290 0.2012 0
25 4.5344 -0.1649 0.2398 0
26 4.0734 -0.7164 -0.1066 0
27 1.1910 0.5247 0.0005 0
28 1.6473 0.1987 -0.0033 0
1 2 3 4 5 6 7 \
car_id
24 509552.97 5039855.57 10.0821 0.3853 0.2183 20 NaN
25 509583.77 5039848.71 NaN NaN NaN NaN NaN
26 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17
27 509582.71 5039862.78 1.2015 0.5322 NaN NaN NaN
28 509591.04 5039862.89 1.6513 0.2015 -0.0036 20 NaN
8 9 10 11 12
car_id
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 5039855.55 10.0886 0.2636 0.2356 40
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
Postprocessing 后期处理
First you have to estimate max number of columns N
. 首先,您必须估计最大列数
N
I know their real number is 26
, so I estimate to N = 30
. 我知道他们的真实数字是
26
,所以我估计N = 30
。 Function read_csv
with parameter name = range(N)
return NaN
columns, what are difference between estimated and real length of columns. 函数
read_csv
,参数name = range(N)
返回NaN
列,列的估计长度和实际长度之间有什么区别。
After dropping you can select first row with columns names, where are not NaN
(I remove last column )
by [:-1]
) - df1.loc[0].dropna()[:-1]
. 删除后,您可以选择第一行的列名,其中不是
NaN
(我删除最后一列)
[:-1]
) - df1.loc[0].dropna()[:-1]
。 Then you can append new Series with range from 1 to length of NaN
values in first row. 然后,您可以在第一行中添加范围从1到
NaN
值的新系列。 Last first row is removed by subset of df
. 最后一行由
df
的子集删除。
#set more as estimated number of columns
N = 30
df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))
df1 = df1.dropna(axis=1, how='all') #drop columns with all NaN
df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))
#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.