I am new in programming. I am trying to clean the data from a csv file for a further project extension. The csv file that is given as an input is really messy and I need its particular portions only.
I am trying to extract the value for the 'OBSERVATION_MODE', 'LON' and 'LAT' so far but I am not sure how to append the later values. This is what I have tried so far:
import csv
import re
file = csv.reader(open('1mvn_kp_iuvs_2018_01_r01.tab.csv','r'))
mode = []
lat = []
for row in file:
for values in row:
if 'OBSERVATION_MODE' in values:
print("\n")
mode.append(row)
if re.search('LAT', values):
lat.append(row)
print(mode)
print(lat)
I am pretty sure the logic I am trying to work on is not at all useful. Can someone please give me a better overview of this ? I tried searching online too, but I found nothing to clean the data when the rows and columns both are mismatched. Any help is appreciated !
Thank You
Link to the inut csv file and expected output is https://drive.google.com/open?id=1LJxxbDcplSCPVWKnOC3usx7kZE8dS32H
Please note that the expected output 'Cleaned_sample.xlsx' is something I have manually generated and I want a similar output using python programming.
You should try to use the read_csv function from pandas. There are mutliple keywords such as header, skiprows or usecols that allow you to set where you data starts in the file, skip a number of rows, only use specific columns, etc... The returned object is similar to an array and you can easily access your data.
Example based on the file you provided:
data = pandas.read_csv(path_to_file, skiprows=44, skipfooter=378, engine='python', dtype='float')
This call will read the first set of data that you have in your file. To access the fifth value in the ALTITUDE column, you can for example do
data['ALTITUDE'][4]
Then you would have to use a similar read_csv call with different values of skiprows and skipfooter to access the other sets of data. Once you have them all, a call to concatenate from numpy should allow you to have all your data as one array. Be careful with the headers.
Note that lambda expressions can be used in skiprows, it may allow you to call read_csv() only once if you find a pattern that you can use to specify which rows you do not want.
try this,
import pandas as pd
df1=pd.read_csv('1mvn_kp_iuvs_2018_01_r01.tab (1).csv',header=None,nrows=18)
dic=df1.set_index(0)[2].to_dict()
for u,v in dic.items():
dic[u]=[v]
df1= pd.DataFrame(dic)
df2=pd.read_csv('1mvn_kp_iuvs_2018_01_r01.tab (1).csv',skiprows=19)
df1 = pd.concat([df1]*len(df2),ignore_index=True)
df3=pd.concat([df1,df2],axis=1)
print df3.head()
Note: I have removed few rows from original file to make identical between your sample.
Input:
Output:
LAT LAT_MSO LOCAL_TIME LON LON_MSO MARS_SEASON_LS \
0 -19.512522 NaN 8.083779 6.757075 NaN 108.81089
1 -19.512522 NaN 8.083779 6.757075 NaN 108.81089
2 -19.512522 NaN 8.083779 6.757075 NaN 108.81089
3 -19.512522 NaN 8.083779 6.757075 NaN 108.81089
4 -19.512522 NaN 8.083779 6.757075 NaN 108.81089
MARS_SUN_DIST ORBIT_NUMBER SC_ALT SC_GEO_LAT ... \
0 1.630965 6330.0 203.680405 -17.815445 ...
1 1.630965 6330.0 203.680405 -17.815445 ...
2 1.630965 6330.0 203.680405 -17.815445 ...
3 1.630965 6330.0 203.680405 -17.815445 ...
4 1.630965 6330.0 203.680405 -17.815445 ...
SUBSOL_GEO_LON SZA ALTITUDE CO2 CO2+ O \
0 65.4571 71.790688 80 -9999999000 -9999999000 -9999999000
1 65.4571 71.790688 90 -9999999000 -9999999000 -9999999000
2 65.4571 71.790688 100 -9999999000 -9999999000 -9999999000
3 65.4571 71.790688 110 -9999999000 -9999999000 -9999999000
4 65.4571 71.790688 120 -9999999000 -9999999000 551467460
N2 C N H
0 -9999999000 -9999999000 -9999999000 -9999999000
1 -9999999000 -9999999000 -9999999000 -9999999000
2 -9999999000 -9999999000 -9999999000 -9999999000
3 -9999999000 -9999999000 -9999999000 -9999999000
4 710188930 -9999999000 -9999999000 -9999999000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.