I have R code to use data table to merger the rows with same FirstName and LastName but selecting the max value for specified columns(eg Score1, Score2, Score3). The input/output is as follows:
Input:
FirstName LastName Score1 Score2 Score3 fn1 ln1 41 88 50 fn1 ln1 72 66 77 fn1 ln1 69 72 90 fn2 ln2 80 81 73 fn2 ln2 59 91 66 fn3 ln3 75 80 66
Output:
FirstName LastName Score1 Score2 Score3 fn1 ln1 72 88 90 fn2 ln2 80 91 73 fn3 ln3 75 80 66
Now I want to migrate the R program to Spark. How can I do this by using Python?
As suggested by durbachit, you'll want to use pandas.
import pandas as pd
df = pd.read_csv(**your file here**)
max_df = df.groupby(by=['FirstName','LastName']).max()
And max_df will be your desired output. Docs for pandas groupby.
Here is the way to do it with in-built packages of python:
import csv
from collections import OrderedDict
newdata = OrderedDict()
with open('test.csv', 'rb') as testr:
testreader = csv.reader(testr)
for row in testreader:
name = row[0]+ '-' + row[1]
if name in newdata:
newdata[name] = [max(existdata, readdata) for existdata, readdata in zip(newdata[name], row[2:])]
else:
newdata[name] = row[2:]
with open('newdata.csv', 'wb') as testw:
testwriter = csv.writer(testw)
for name, data in newdata.iteritems():
testwriter.writerow(name.split('-') + data)
Best way is to do it is with Pandas, will post in a while.
Here is the pandas code:
import pandas
readfile = pandas.read_csv('test.csv') # assuming your CSV is same directory as program
print readfile
max_readfile = readfile.groupby(['FirstName', 'LastName']).max()
print max_readfile
output:
** @user2241910 quickly posted the pandas solution :)
import pandas
rows = pandas.read_csv('rows.csv', delim_whitespace=True)
rows.groupby(['FirstName', 'LastName']).max()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.