These are my Python codes to extract specific string from string list.
def readHdFile(filename):
with hdfs.open_input_file(filename) as inf:
read_data = inf.read().decode('utf-8').splitlines()
print("output #1 {}".format(read_data))
return read_data
list_data = readHdFile('test.csv')
for data in list_data:
print("output #2 {}".format(data))
The codes work correctly without errors.
output #1 ['date,values,realtime_start,realtime_end,state,id,title,frequency_short,units_short,seasonal_adjustment_short', '2007-01-01,6.3,2021-02-16,2021-02-16,Alaska,LAUST020000000000003A,Unemployment Rate in Alaska,A,%,NSA', '2008-01-01,6.7,2021-02-16,2021-02-16,Alaska,LAUST020000000000003A,Unemployment Rate in Alaska,A,%,NSA']
output #2 date,values,realtime_start,realtime_end,state,id,title,frequency_short,units_short,seasonal_adjustment_short
output #2 2007-01-01,6.3,2021-02-16,2021-02-16,Alaska,LAUST020000000000003A,Unemployment Rate in Alaska,A,%,NSA
output #2 2008-01-01,6.7,2021-02-16,2021-02-16,Alaska,LAUST020000000000003A,Unemployment Rate in Alaska,A,%,NSA
But I have to remove some specific columns, realtime_start
and realtime_end
from the read_data
object. In output #1 the read_data
list string is separated with "," character. But I have no idea how to remove specific column of data
string, realtime_start
and realtime_end
.
I am not 100% sure of the data format you are using, but you could try this on your last 2 lines of code:
for line in list_data:
outline = line.split(',')
new_line = ','.join(outline[:2]) + ',' + ','.join(outline[4:])
print("output #2 {}".format(new_line))
real_time_start and real_time_end are the 3rd and 4th column of your csv, so you can just print a new line without those fields.
Of course this is the quick and dirty solution, using Pandas may be cleaner and more robust to new datasets,
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.