简体   繁体   中英

How to get rows with the max value by using Python?

I have R code to use data table to merger the rows with same FirstName and LastName but selecting the max value for specified columns(eg Score1, Score2, Score3). The input/output is as follows:

Input:

FirstName LastName Score1 Score2 Score3
fn1       ln1      41      88     50
fn1       ln1      72      66     77
fn1       ln1      69      72     90
fn2       ln2      80      81     73
fn2       ln2      59      91     66
fn3       ln3      75      80     66

Output:

FirstName LastName Score1 Score2 Score3
fn1       ln1      72      88     90
fn2       ln2      80      91     73
fn3       ln3      75      80     66

Now I want to migrate the R program to Spark. How can I do this by using Python?

As suggested by durbachit, you'll want to use pandas.

import pandas as pd
df = pd.read_csv(**your file here**)
max_df = df.groupby(by=['FirstName','LastName']).max()

And max_df will be your desired output. Docs for pandas groupby.

Here is the way to do it with in-built packages of python:

import csv
from collections import OrderedDict

newdata = OrderedDict()
with open('test.csv', 'rb') as testr:
    testreader = csv.reader(testr)
    for row in testreader:
        name = row[0]+ '-' + row[1]
        if name in newdata:
            newdata[name] = [max(existdata, readdata) for existdata, readdata in zip(newdata[name], row[2:])]
        else:
            newdata[name] = row[2:]

    with open('newdata.csv', 'wb') as testw:
        testwriter = csv.writer(testw)
        for name, data in newdata.iteritems():
            testwriter.writerow(name.split('-') + data)

Best way is to do it is with Pandas, will post in a while.

EDIT:

Here is the pandas code:

import pandas
readfile = pandas.read_csv('test.csv') # assuming your CSV is same directory as program
print readfile

实际数据

max_readfile = readfile.groupby(['FirstName', 'LastName']).max()
print max_readfile

output:

熊猫输出

** @user2241910 quickly posted the pandas solution :)

import pandas
rows = pandas.read_csv('rows.csv', delim_whitespace=True)
rows.groupby(['FirstName', 'LastName']).max()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM