简体   繁体   中英

Optimal filling of pandas DataFrame column by matching values in another DataFrame

Basically I have two DataFrames and want to re-populate a column of the second by matching three row elements of the second with the first. To give an example, I have columns "Period" and "Hub" in both DataFrames. For each row in the second DataFrame, I want to take the value of Index (which is a date) and "Product"/"Hub" (which are strings) and find the row in the first DataFrame that has these same values (in the corresponding columns) and return the value of "Period" from that row. I can then populate my row in the second DataFrame with this value.

I have a working solution, but it's really slow. Perhaps this is just due to the size of the DataFrames (approx. 100k rows) but it's taking over an hour to process!

Anyway, this is my working solution - any tips on how to speed it up would be really appreciated!

def selectData(hub, product):
    qry = "Hub=='"+hub+"' and Product=='"+product+"'"
    return data_1.query(qry)

data_2["Period"] = data_2.apply(lambda row: selectData(row["Hub"], row["Product"]).ix[row.index, "Period"], axis=1)

EDIT: I should note that the first DataFrame is guaranteed to have a unique result to my query but contains a larger set of data than that required to populate data_2

EDIT2: I just realised this is not in fact a working solution...

if i understand your problem correctly, you want merge these 2 dataframe on index(date), Product, Hub and obtain Period from data_1

I don't have data but tested it on random int s. It should be very fast with 100k rows in data_1

#data_1 is the larger dictonary

n=100000
data_1 = pd.DataFrame(np.random.randint(1,100,(n,3)), 
                      index=pd.date_range('2012-01-01',periods=n, freq='1Min').date,
                      columns=['Product', 'Hub', 'Period']).drop_duplicates()
data_1.index.name='Date'

#data_2 is a random subset, w/o column Period
data_2 = data_1.ix[np.random.randint(0,len(data_1),1000), ['Product','Hub']]

To join on index + some columns, you can do this:

data_3 = data_2.reset_index().merge(data_1.reset_index(), on=['Date','Product','Hub'], how='left')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM