I am trying to make python read an excel file, then create dataframes from.csv files who are named after rows in the excel file and index data from the.csv files and paste them in the excel file.
the excel file has been put in a dataframe, which has the following layout:
Name Location Date Check_2 ... Volume VWAP $Volume Trades
0 Orange New York 20200501 X ... NaN NaN NaN NaN
1 Apple Minsk 20200504 X ... NaN NaN NaN NaN
The empty rows should be filled with data that is indexed from.csv files who have been put in a dataframe, which looks like this:
Name Date Time Open High Low Close Volume VWAP Trades
4 Orange 20200501 15:30:00 5.50 5.85 5.45 5.70 1500 5.73 95
5 Orange 20200501 17:00:00 5.65 5.70 5.50 5.60 1600 5.65 54
6 Orange 20200501 20:00:00 5.80 5.85 5.45 5.81 1700 5.73 41
7 Orange 20200501 22:00:00 5.60 5.84 5.45 5.65 1800 5.75 62
8 Orange 20200504 15:30:00 5.40 5.87 5.45 5.75 1900 5.83 84
9 Orange 20200504 17:00:00 5.50 5.75 5.40 5.60 2000 5.72 94
10 Orange 20200504 20:00:00 5.80 5.83 5.44 5.50 2100 5.40 55
11 Orange 20200504 22:00:00 5.40 5.58 5.37 5.80 2200 5.35 87
0 Apple 20200504 15:30:00 3.70 3.97 3.65 3.75 1000 3.60 55
1 Apple 20200504 17:00:00 3.65 3.95 3.50 3.80 1200 3.65 68
2 Apple 20200504 20:00:00 3.50 3.83 3.44 3.60 1300 3.73 71
3 Apple 20200504 22:00:00 3.55 3.58 3.35 3.57 1400 3.78 81
4 Apple 20200505 15:30:00 3.50 3.85 3.45 3.70 1500 3.73 95
5 Apple 20200505 17:00:00 3.65 3.70 3.50 3.60 1600 3.65 54
6 Apple 20200505 20:00:00 3.80 3.85 3.45 3.81 1700 3.73 41
7 Apple 20200505 22:00:00 3.60 3.84 3.45 3.65 1800 3.75 62
I have been struggling with filling these empty cells, because I haven't been able to find a way to properly index match across these 2 dataframes.
For example, trying:
intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])
rdf1['Open'] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])
produces:
Open prices rdf1
0 5.5
1 3.7
but only takes account into date, so it will copy the open value of the 'Date' column, not 'Name' and 'Date', which is a problem because those are the 2 values that need to be matched.
also, this code produces the following error:
A value is trying to be set on a copy of a slice from a DataFrame.Try using.loc[row_indexer,col_indexer] = value instead
but when I try to fix that with
rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
I get an error:
KeyError: "None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"
Which doesn't make sense to me, because the whole goal is to fill these 'NaN' values.
Can someone here help me out with making something that can index match data from these dataframes and write it to the Excel file?
Thanks!
EDIT: Forgot to post my full code, here it is:
import pandas as pd
import os
#Opening 'Test Tracker.xlsx' to find entities to download
TEST = pd.ExcelFile("Trackers\TEST Tracker.xlsx")
df1 = TEST.parse("Entries")
values1 = df1[['Name', 'Location', 'Date', 'Check_2',
'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', '$Volume',
'Trades']]
#Searching for every row that contains the value 'X' in the column 'Check_2'
rdf1 = values1[values1.Check_2.str.contains("X")]
#Printing dataframe to check
print("First Dataframe")
print(rdf1)
#creating a list for the class objects
Fruits = []
#Generating dataframes from classobjects
for idx, rows in rdf1.iterrows():
fle = os.path.join('Entities', rows.Location, rows.Name, 'TwoHours.csv')
col_list = ['Name', 'Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']
df3 = pd.read_csv(fle, usecols=col_list, sep=";")
Fruits.append(df3)
rdf2 = pd.concat(Fruits)
print("Printing Full Data Frame")
print(rdf2)
intradayh = rdf2[(rdf2['Time']>'15:30:00') & (rdf2['Time']<'22:00:00')]
intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])
intradayhc = rdf2[(rdf2['Time']=='22:00:00')]
indexclose = pd.DataFrame(intradayhc['Close'])
rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])
EDIT: Desired output as requested in the comments:
Name Location Date Open High Low close volume VWAP ...
0 Orange New York 20200501 5.5 5.95 5.45 5.65 6600 5.71 ...
1 Apple Minsk 20200504 3.7 3.83 3.35 3.57 4900 3.69 ...
I am going for a 1 to 1 match in 'Open', a max value in 'High', a min value in 'Low', a 1 to 1 match in 'Close', a sum value for 'Volume' and 'Trades'. an average for 'VWAP' and the value of 'Volume * VWAP' in '$Volume'.
df
, your nan datframe and df2
; your bigger dataframe with all data
Use groupby
together with .agg()
to find multiple aggregations on multiple columns
df2=df1.groupby(['Name','Date']).agg(Open=('Open','first'), Close=('Close','last'),High=('High','max'),Low=('Low','min'),Volume=('Volume','sum'),VWAP=('VWAP','mean')).reset_index()
One way is then to do an inner merge and slice the updated columns
result = pd.merge(df2, df, how='inner', on=['Name', 'Date']).iloc[:,:-4]
or after aggregation, use combine_first
and drop all the NaNs
result= (df.set_index('Date').combine_first(df2.set_index('Date')).reset_index())
result=result[k.notna()]
result
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.