简体   繁体   中英

Python: Cleaning the data from the csv file that is mismatched

I am new in programming. I am trying to clean the data from a csv file for a further project extension. The csv file that is given as an input is really messy and I need its particular portions only.

Input File is as follows: 在此处输入图片说明

Required Format: 在此处输入图片说明

I am trying to extract the value for the 'OBSERVATION_MODE', 'LON' and 'LAT' so far but I am not sure how to append the later values. This is what I have tried so far:

import csv
import re

file = csv.reader(open('1mvn_kp_iuvs_2018_01_r01.tab.csv','r'))
mode = []
lat = []
for row in file:
    for values in row:
        if 'OBSERVATION_MODE' in values:
            print("\n")
            mode.append(row)

        if re.search('LAT', values):
            lat.append(row)

print(mode)
print(lat)

I am pretty sure the logic I am trying to work on is not at all useful. Can someone please give me a better overview of this ? I tried searching online too, but I found nothing to clean the data when the rows and columns both are mismatched. Any help is appreciated !

Thank You

Link to the inut csv file and expected output is https://drive.google.com/open?id=1LJxxbDcplSCPVWKnOC3usx7kZE8dS32H

Please note that the expected output 'Cleaned_sample.xlsx' is something I have manually generated and I want a similar output using python programming.

You should try to use the read_csv function from pandas. There are mutliple keywords such as header, skiprows or usecols that allow you to set where you data starts in the file, skip a number of rows, only use specific columns, etc... The returned object is similar to an array and you can easily access your data.

Example based on the file you provided:

data = pandas.read_csv(path_to_file, skiprows=44, skipfooter=378, engine='python', dtype='float')

This call will read the first set of data that you have in your file. To access the fifth value in the ALTITUDE column, you can for example do

data['ALTITUDE'][4]

Then you would have to use a similar read_csv call with different values of skiprows and skipfooter to access the other sets of data. Once you have them all, a call to concatenate from numpy should allow you to have all your data as one array. Be careful with the headers.

Note that lambda expressions can be used in skiprows, it may allow you to call read_csv() only once if you find a pattern that you can use to specify which rows you do not want.

try this,

import pandas as pd
df1=pd.read_csv('1mvn_kp_iuvs_2018_01_r01.tab (1).csv',header=None,nrows=18)
dic=df1.set_index(0)[2].to_dict()
for u,v in dic.items():
    dic[u]=[v]
df1= pd.DataFrame(dic)
df2=pd.read_csv('1mvn_kp_iuvs_2018_01_r01.tab (1).csv',skiprows=19)
df1 =  pd.concat([df1]*len(df2),ignore_index=True)
df3=pd.concat([df1,df2],axis=1)
print df3.head()

Note: I have removed few rows from original file to make identical between your sample.

Input:

在此处输入图片说明

Output:

         LAT  LAT_MSO  LOCAL_TIME       LON  LON_MSO  MARS_SEASON_LS  \
0 -19.512522      NaN    8.083779  6.757075      NaN       108.81089   
1 -19.512522      NaN    8.083779  6.757075      NaN       108.81089   
2 -19.512522      NaN    8.083779  6.757075      NaN       108.81089   
3 -19.512522      NaN    8.083779  6.757075      NaN       108.81089   
4 -19.512522      NaN    8.083779  6.757075      NaN       108.81089   

   MARS_SUN_DIST  ORBIT_NUMBER      SC_ALT  SC_GEO_LAT     ...       \
0       1.630965        6330.0  203.680405  -17.815445     ...        
1       1.630965        6330.0  203.680405  -17.815445     ...        
2       1.630965        6330.0  203.680405  -17.815445     ...        
3       1.630965        6330.0  203.680405  -17.815445     ...        
4       1.630965        6330.0  203.680405  -17.815445     ...        

   SUBSOL_GEO_LON        SZA  ALTITUDE          CO2         CO2+            O  \
0         65.4571  71.790688        80  -9999999000  -9999999000  -9999999000   
1         65.4571  71.790688        90  -9999999000  -9999999000  -9999999000   
2         65.4571  71.790688       100  -9999999000  -9999999000  -9999999000   
3         65.4571  71.790688       110  -9999999000  -9999999000  -9999999000   
4         65.4571  71.790688       120  -9999999000  -9999999000    551467460   

            N2            C            N            H  
0  -9999999000  -9999999000  -9999999000  -9999999000  
1  -9999999000  -9999999000  -9999999000  -9999999000  
2  -9999999000  -9999999000  -9999999000  -9999999000  
3  -9999999000  -9999999000  -9999999000  -9999999000  
4    710188930  -9999999000  -9999999000  -9999999000  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM