简体   繁体   中英

how to group by multi-index(including initial number index and other columns) in python dataframe?

I am working on groupby in Python's pd.DataFrame. The task in the code is that I want to group the data because I want to make sure that no matter how many times I query and output the data to MySQL, it won't mess with my raw data.

df1=pd.DataFrame(df) #this is a DataFrame with multiple different lines of 'Open' for one 'Symbol'
df2=pd.read_sql('select * from 6openposition',con=conn)
df2=df2.append(df1)    
df2=df2.groupby(['Symbol']).agg({'Open':'first'})
df2.to_sql(name='6openposition', con=conn, if_exists='replace', index= False, flavor = 'mysql')

#Example Raw Data:
   Symbol   Open
0    A       10
1    AA      20
2    AA      30
3    AAA     40
4    AAA     50
5    AAA     50

#After I query the data for multiple times(I appended):
   Symbol   Open
0    A       10
1    AA      20
2    AA      30
3    AAA     40
4    AAA     50
5    AAA     50
0    AA      30
1    AAA     40
2    AAA     50
3    AAA     50
4    AAA     60

#How my code ended up with:
   Symbol   Open
0    A       10
1    AA      20
2    AAA     40

#What I want:
   Symbol   Open
0    A       10
1    AA      20
2    AA      30
3    AAA     40
4    AAA     50
5    AAA     50
6    AAA     60

My raw data could have multiple value in column 'Open' for same 'Symbol'. As I eliminate the influence of my multiple times of input to MySQL, raw data here is influenced.

My thought on solving this problem is to group by the initial index and 'Symbol' at the same time because after append the initial indices could be another 'group by' column. The initial indices are [0,1,2,...]. If the 'Symbol' and initial indices are the same, I could take the first value of 'Open' in that group. To group by initial indices I could:

df2=df2.groupby(level=0).agg({'Open':'first'})
#this code will combine the lines with same indices and take the first value of 'Open' column 

But I have no idea how to combine 'level=0' to 'level='Symbol''. Could you teach me how to group by two columns including initial indices and another column? Or tell me a way to eliminate multiple times of input not messing with my raw data.

Starting with df , including your index which seems to indicate whether data are repeated:

  Symbol  Open
0      A    10
1     AA    20
2     AA    30
3    AAA    40
4    AAA    50
5    AAA    50
2     AA    30
3    AAA    40
4    AAA    50
5    AAA    50

Use

df.reset_index().drop_duplicates().drop('index', axis=1)

(keeps first occurrence by default ) to get:

  Symbol  Open
0      A    10
1     AA    20
2     AA    30
3    AAA    40
4    AAA    50
5    AAA    50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM