简体   繁体   中英

python pandas merge/vlookup tables

I was writing the Python code below to merge two tables, which could be done in Excel using Vlookup, but wanted to automate this process for a larger data set. However, it seems the output data is too big and contains all columns from both tables. I just wanted to use the second table, df_pos to lookup some columns. Would you take a look if my code is efficient or feasible to perform this task?

Thank you!

def weighted(mwa="mwa.csv",mwa2="mwa.csv",output="WeightedMWA.csv"):
    df=pd.read_csv(mwa, thousands=",")
    df['Keyword']=df['Keyword'].replace('+','')
    df_pos=pd.read_csv("mwa.csv", thousands=",")
    df_pos['Keyword']=df_pos['Keyword'].replace('+','')
    sumImp=df_pos['Impr.'].sum()
    sumPos=df_pos.groupby(by=['Keyword'])['Avg. Pos.'].sum()
    df_pos['WeightedPos']=sumPos/sumImp
    mergedDF=pd.merge(left=df, right=df_pos, how="left", left_on="Keyword",right_on="Keyword")
    mergedDF.to_csv(output)

You didn't provide us with enough information. You are outputting the merged dataframe but you have not told up which columns are necessary in the output. Ideally, you'd want to keep only the columns that are needed in the output plus the columns needed for the merge.

You can limit the columns that you import via the read_csv function and its usecols parameter. The documentation says:

 usecols : array-like, default None Return a subset of the columns. All elements in this array must either be positional (ie integer indices into the document columns) or strings that correspond to column names provided either by the user in `names` or inferred from the document header row(s). For example, a valid `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter results in much faster parsing time and lower memory usage. 

If you are just using df_pos to lookup data from another matrix, just use the field in df_pos as an index to the frame you're looking up data from, ie datasourcematrix[df_pos.LOOKUPCOLUMNNAME] or if you don't have a column name, you can do datasourcematrix[df_pos.ix[5]] or whatever. Much easier and faster...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM