im using pandas to combine some csv files.
I need to create multiple new columns based on one of the rows, in this case, network. Currently I have as u can see a bunch of applies to create the columns, and this is hurting the performance, is there a way that I could create multiple columns with just one apply, or a more performant way to achieve the same result?
dataReader = pd.read_csv('file.csv', usecols=['geoname_id' , 'country_iso_code','country_name','subdivision_1_name','subdivision_2_name','city_name','time_zone'])
rangeReader = pd.read_csv('file2.csv', chunksize = size, usecols=['geoname_id','network'])
start_time = time.time()
output = open("result.csv" , 'w')
#removes countries we dont care about
dataReader = dataReader[(dataReader.country_iso_code.isin(countries))]
addHeader = True
for chunk in rangeReader:
print("Loop ",i,"took %s seconds" % (time.time() - start_time))
chunk = pd.merge(chunk, dataReader, on="geoname_id", how="inner")
chunk['low_ip'] = chunk.apply(lambda row: getLowIp(row), axis=1)
chunk['high_ip'] = chunk.apply(lambda row: getHighIp(row), axis=1)
chunk['low_ip_int']= chunk.apply(lambda row: getIpInt(row['low_ip']), axis=1)
chunk['high_ip_int']= chunk.apply(lambda row: getIpInt(row['high_ip']), axis=1)
chunk['json'] = chunk.apply(lambda row: toElasticJson(row), axis=1)
chunk.to_csv(output, header=addHeader, sep='|')
addHeader = False
After some digging I found out
lambda should return a pd.Series() like
return pd.Series((low , high, int(IPAddress(low)) , int(IPAddress(high))))
and the assign would be
chunk[['low_ip' , 'high_ip' , 'low_ip_int', 'high_ip_int']] = chunk.apply(lambda row: getAllIpFields(row['network']), axis=1)
this way I joined all applies into one, saving some performance.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.