Pandas index match multiple dataframes with multiple criteria

Question

I am trying to make python read an excel file, then create dataframes from.csv files who are named after rows in the excel file and index data from the.csv files and paste them in the excel file.

the excel file has been put in a dataframe, which has the following layout:

     Name  Location      Date Check_2  ...  Volume  VWAP  $Volume  Trades
0  Orange  New York  20200501       X  ...     NaN   NaN      NaN     NaN
1   Apple     Minsk  20200504       X  ...     NaN   NaN      NaN     NaN

The empty rows should be filled with data that is indexed from.csv files who have been put in a dataframe, which looks like this:

  Name      Date      Time  Open  High   Low  Close  Volume  VWAP  Trades
4   Orange  20200501  15:30:00  5.50  5.85  5.45   5.70    1500  5.73      95
5   Orange  20200501  17:00:00  5.65  5.70  5.50   5.60    1600  5.65      54
6   Orange  20200501  20:00:00  5.80  5.85  5.45   5.81    1700  5.73      41
7   Orange  20200501  22:00:00  5.60  5.84  5.45   5.65    1800  5.75      62
8   Orange  20200504  15:30:00  5.40  5.87  5.45   5.75    1900  5.83      84
9   Orange  20200504  17:00:00  5.50  5.75  5.40   5.60    2000  5.72      94
10  Orange  20200504  20:00:00  5.80  5.83  5.44   5.50    2100  5.40      55
11  Orange  20200504  22:00:00  5.40  5.58  5.37   5.80    2200  5.35      87
0    Apple  20200504  15:30:00  3.70  3.97  3.65   3.75    1000  3.60      55
1    Apple  20200504  17:00:00  3.65  3.95  3.50   3.80    1200  3.65      68
2    Apple  20200504  20:00:00  3.50  3.83  3.44   3.60    1300  3.73      71
3    Apple  20200504  22:00:00  3.55  3.58  3.35   3.57    1400  3.78      81
4    Apple  20200505  15:30:00  3.50  3.85  3.45   3.70    1500  3.73      95
5    Apple  20200505  17:00:00  3.65  3.70  3.50   3.60    1600  3.65      54
6    Apple  20200505  20:00:00  3.80  3.85  3.45   3.81    1700  3.73      41
7    Apple  20200505  22:00:00  3.60  3.84  3.45   3.65    1800  3.75      62

I have been struggling with filling these empty cells, because I haven't been able to find a way to properly index match across these 2 dataframes.

For example, trying:

intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])

rdf1['Open'] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])

produces:

Open prices rdf1
0    5.5
1    3.7

but only takes account into date, so it will copy the open value of the 'Date' column, not 'Name' and 'Date', which is a problem because those are the 2 values that need to be matched.

also, this code produces the following error:

A value is trying to be set on a copy of a slice from a DataFrame.Try using.loc[row_indexer,col_indexer] = value instead

but when I try to fix that with

rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())

I get an error:

KeyError: "None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"

Which doesn't make sense to me, because the whole goal is to fill these 'NaN' values.

Can someone here help me out with making something that can index match data from these dataframes and write it to the Excel file?

Thanks!

EDIT: Forgot to post my full code, here it is:

import pandas as pd
import os

#Opening 'Test Tracker.xlsx' to find entities to download
TEST = pd.ExcelFile("Trackers\TEST Tracker.xlsx")
df1 = TEST.parse("Entries")

values1 = df1[['Name', 'Location', 'Date', 'Check_2',
           'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', '$Volume', 
'Trades']]

#Searching for every row that contains the value 'X' in the column 'Check_2'
rdf1 = values1[values1.Check_2.str.contains("X")]

#Printing dataframe to check
print("First Dataframe")
print(rdf1)

#creating a list for the class objects
Fruits = []

#Generating dataframes from classobjects
for idx, rows in rdf1.iterrows():
    fle = os.path.join('Entities', rows.Location, rows.Name, 'TwoHours.csv')
    col_list = ['Name', 'Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']
    df3 = pd.read_csv(fle, usecols=col_list, sep=";")
    Fruits.append(df3)

rdf2 = pd.concat(Fruits)
print("Printing Full Data Frame")
print(rdf2)

intradayh = rdf2[(rdf2['Time']>'15:30:00') & (rdf2['Time']<'22:00:00')]
intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])
intradayhc = rdf2[(rdf2['Time']=='22:00:00')]
indexclose = pd.DataFrame(intradayhc['Close'])

rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])

EDIT: Desired output as requested in the comments:

  Name  Location      Date    Open   High   Low    close  volume  VWAP ...
0  Orange  New York  20200501  5.5    5.95  5.45    5.65   6600   5.71  ...
1   Apple     Minsk  20200504  3.7    3.83  3.35    3.57   4900   3.69 ...

I am going for a 1 to 1 match in 'Open', a max value in 'High', a min value in 'Low', a 1 to 1 match in 'Close', a sum value for 'Volume' and 'Trades'. an average for 'VWAP' and the value of 'Volume * VWAP' in '$Volume'.

Answer 1

df , your nan datframe and df2 ; your bigger dataframe with all data

Use groupby together with .agg() to find multiple aggregations on multiple columns

df2=df1.groupby(['Name','Date']).agg(Open=('Open','first'), Close=('Close','last'),High=('High','max'),Low=('Low','min'),Volume=('Volume','sum'),VWAP=('VWAP','mean')).reset_index()

One way is then to do an inner merge and slice the updated columns

result = pd.merge(df2, df, how='inner', on=['Name', 'Date']).iloc[:,:-4]

or after aggregation, use combine_first and drop all the NaNs

result= (df.set_index('Date').combine_first(df2.set_index('Date')).reset_index())
result=result[k.notna()]

result

Pandas index match multiple dataframes with multiple criteria

Question

1 answers

solution1
2 ACCPTED 2020-05-26 21:23:26

Pandas index match multiple dataframes with multiple criteria

Question

1 answers

solution1 2 ACCPTED 2020-05-26 21:23:26

solution1
2 ACCPTED 2020-05-26 21:23:26