简体   繁体   English

如何阅读pandas中的非结构化csv

[英]How to read unstructured csv in pandas

I have got a messy csv file (just extension is csv). 我有一个凌乱的csv文件(只是扩展名为csv)。 But when i open this file in ms excel with ; 但是当我用ms excel打开这个文件时; delimited it looks like as below(dummy sample)- 划分它看起来如下(虚拟样本) -

I investigated this file and found following: 我调查了这个文件,发现如下:

  1. Some column has name and others does not have. 有些列有名称,有些列没有。
  2. The length of row is variable but contains newline char to trigger next line start. 行的长度是可变的,但包含换行符char以触发下一行开始。

Question: 题:

How can i read this table in pandas whereas all existing columns(headers) remain and blank columns are filled with consecutive numbers caring variable length of rows. 如何在pandas中读取此表,而所有现有列(标题)保留,空白列填充连续数字,可以控制行的可变长度。

In fact i want to take 8 cell-value again and again until any row exhausts. 事实上,我想一次又一次地取8个单元格值,直到排出任何一排。 from the header-less columns for analysis. 从无标题列进行分析。

NB I have tried usecols , names , skiprows , sep etc in read_csv but with no success 注:我已经试过usecolsnamesskiprowssep在等read_csv ,但没有成功

数据

EDIT 编辑

Added sample input and expected output (formatting is worse but pandas.read_clipboard( ) should work) 添加了样本输入和预期输出(格式化更糟,但pandas.read_clipboard( )应该有效)

INPUT INPUT

car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   )                                           
24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20  

Expected OUTPUT(dataframe) 预期的输出(数据帧)

car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   1   2   3   4   5   6   7   8   9   10  11  12
24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20      

Preprocessing 预处理

Function get_names() open file, check max length of splitted rows. 函数get_names()打开文件,检查分割行的最大长度。 Then I read first row and add missing values from max length. 然后我读取第一行并从最大长度添加缺失值。

Last value of first row is ) , so I remove it by firstline[:-1] and then I add to range missing columns by +1 rng = range(1, m - lenfirstline + 2) . 第一行的最后一个值是) ,所以我通过firstline[:-1]删除它,然后我按+1 rng = range(1, m - lenfirstline + 2)添加范围缺失列。 +2 is because range starts from value 1 . +2是因为范围从值1开始。

Then you can use function read_csv , skipp first line and as names use output from get_names() . 然后你可以使用函数read_csv ,skipp第一行和名称使用get_names()输出。

import pandas as pd
import csv

#preprocessing
def get_names():
    with open('test/file.txt', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            if i ==0:
                firstline = ''.join(row).split()
                lenfirstline = len(firstline)
                #print firstline, lenfirstline
            num.append(len(''.join(row).split()))
        m = max(num)
        rng = range(1, m - lenfirstline + 2)
        #remove )
        rng = firstline[:-1] + rng
        return rng

#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
    print df

       car_type entry_gate entry_time(ms) exit_gate exit_time(ms)  \
car_id                                                              
24          Bus         25           4300        26         48520   
25          Car         25             20        26         45900   
26          Car          -              -         -             -   
27          Car          -              -         -             -   
28          Car          -              -         -             -   

        traveled_dist(m)  avg_speed(m/s)  trajectory(x[m]        y[m]  \
car_id                                                                  
24                118.47        2.678999        509552.78  5039855.59   
25                113.91        2.482746        509583.70  5039848.78   
26                109.68        8.859805        509572.75  5039862.75   
27                119.84        3.075936        509582.73  5039862.78   
28                129.64        4.347466        509591.07  5039862.90   

        speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2]  timestamp[ms]  \
car_id                                                                   
24         10.0740              0.4290           0.2012              0   
25          4.5344             -0.1649           0.2398              0   
26          4.0734             -0.7164          -0.1066              0   
27          1.1910              0.5247           0.0005              0   
28          1.6473              0.1987          -0.0033              0   

                1           2        3       4       5   6          7  \
car_id                                                                  
24      509552.97  5039855.57  10.0821  0.3853  0.2183  20        NaN   
25      509583.77  5039848.71      NaN     NaN     NaN NaN        NaN   
26      509572.67  5039862.76   4.0593 -0.7021 -0.1141  20  509553.17   
27      509582.71  5039862.78   1.2015  0.5322     NaN NaN        NaN   
28      509591.04  5039862.89   1.6513  0.2015 -0.0036  20        NaN   

                 8        9      10      11  12  
car_id                                           
24             NaN      NaN     NaN     NaN NaN  
25             NaN      NaN     NaN     NaN NaN  
26      5039855.55  10.0886  0.2636  0.2356  40  
27             NaN      NaN     NaN     NaN NaN  
28             NaN      NaN     NaN     NaN NaN  

Postprocessing 后期处理

First you have to estimate max number of columns N . 首先,您必须估计最大列数N I know their real number is 26 , so I estimate to N = 30 . 我知道他们的真实数字是26 ,所以我估计N = 30 Function read_csv with parameter name = range(N) return NaN columns, what are difference between estimated and real length of columns. 函数read_csv ,参数name = range(N)返回NaN列,列的估计长度和实际长度之间有什么区别。

After dropping you can select first row with columns names, where are not NaN (I remove last column ) by [:-1] ) - df1.loc[0].dropna()[:-1] . 删除后,您可以选择第一行的列名,其中不是NaN (我删除最后一列) [:-1] ) - df1.loc[0].dropna()[:-1] Then you can append new Series with range from 1 to length of NaN values in first row. 然后,您可以在第一行中添加范围从1到NaN值的新系列。 Last first row is removed by subset of df . 最后一行由df的子集删除。

#set more as estimated number of columns
N = 30

df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))

df1 = df1.dropna(axis=1, how='all')  #drop columns with all NaN

df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))

#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM