简体   繁体   English

Append 数据到特定位置的现有 pandas dataframe

[英]Append data into existing pandas dataframe at specific location

I have found a separate solutions for parts of what I want to do but nothing that has worked together.我已经为我想做的部分事情找到了一个单独的解决方案,但没有一起工作。

  1. I am searching a set of files (call them set1) and creating a pandas data frame (df) with a date (as yyyy.doy), a dump number(2 digit number), and a start time(VC5 Start) of the data inside of the file(in gps seconds).我正在搜索一组文件(称为 set1)并创建一个 pandas 数据框(df),其中包含日期(如 yyyy.doy)、转储编号(2 位数字)和开始时间(VC5 开始)文件内的数据(在 gps 秒内)。 the date and dump numbers are actually designated in the file name.日期和转储编号实际上是在文件名中指定的。 the dataframe also has a blank column for the data in the next step.I have no issues here with the following code. dataframe 还有一个空白列用于下一步中的数据。我对以下代码没有任何问题。
df=pd.DataFrame(columns=['Date','Dump Number','VC5 Start', 'VC2 Start'])

for files in VC5filelist:
    #print(files)    
    filedate_df=files[18:26]   #save date of file to a variable
    filedump_df=files[31:33]  #save dump number of file to a variable

    ds=netCDF4.Dataset(files, 'r')     #read each netcdf file into a data set
    VC5gpsstart=ds.variables['TIME'][0]   # save gps first timestamp of VC5 file into a variable
    #append file data to main dataframe
    df = df.append({'Date' : filedate_df, 'Dump' : filedump_df, 'VC5 Start' : VC5gpsstart},ignore_index = True)
  1. I am searching a second set of files (call them set2) for the same info however the start time will be VC2time.我正在搜索第二组文件(称为 set2)以获取相同的信息,但是开始时间将是 VC2time。
for vc2files in VC2filelist:
    print(vc2files)
    vc2filedate_df=vc2files[18:26]   #save date of file to a variable
    vc2filedump_df=vc2files[31:33]  #save dump number of file to a variable
    print(vc2filedate_df+':'+vc2filedump_df)

    dsvc2=netCDF4.Dataset(vc2files, 'r')     #read each netcdf file into a data set
    VC2gpstart=dsvc2.variables['time'][0]   # save gps first timestamp of VC5 file into a variable
    VC2df = VC2df.append({'Date' : vc2filedate_df, 'Dump' : vc2filedump_df, 'VC2 Start' : VC2gpsstart},ignore_index = True)

I want to append/insert the VC2time data into the last column(VC2 Start) and use the date and dump numbers of the second set of files to designate where in the dataframe the starttime should go.我想将 VC2time 数据附加/插入到最后一列(VC2 开始),并使用第二组文件的日期和转储号来指定 dataframe 中的开始时间应为 go。 example例子

Date       Dump       vc5start       vc2start
2022.001   05         121651215      ***456447156***

the bold and italic data is the only thing i cannot produce right now.粗体和斜体数据是我现在唯一无法生成的内容。 I have been trying a find the correct row to insert my data with我一直在尝试找到正确的行来插入我的数据

row=df.index.get_loc(df.query('Date' == vc2filedate_df) and ('Dump'==vc2filedump_df).index[0])

to no avail.无济于事。 my next step was to be我的下一步是

df.loc[row:'VC2 Start']=VC2gpsstart

what I want to know is我想知道的是

A: given the date and dump number of the file from my set2, how do I find the row of the dataframe with the same date and dump number? A:给定我set2中文件的日期和转储号,我如何找到具有相同日期和转储号的dataframe的行?

B: how do I then add the VC2 start data into the VC2 start column of the data frame on the row found in question A? B:那我如何将VC2起始数据添加到问题A中找到的行的数据框的VC2起始列中?

@Larrybird @拉里伯德

VC5df                         VC2df
Date     Dump   VC5time       Date    Dump  VC2time
2022.001   01    125         2022.001  01     125
2022.001   02    128         2022.001  02     130
2022.001   05    260         2022.001  05     261
2022.002   01    035         2022.002  01     035

@LarryBird, I after researching merge I found the (a) solution @LarryBird,我在研究合并后找到了(a)解决方案

creating datframes创建数据框

VC5df=pd.DataFrame(columns=['Date','Dump','VC5 Start'])
VC2df=pd.DataFrame(columns=['Date','Dump','VC2 Start'])

appending data to them within loops (as above), then using在循环中将数据附加到它们(如上),然后使用

merged_df=pd.merge(VC5df,VC2df,on=["Date","Dump"])

creates the following (looking at first and second day of 2022)创建以下内容(查看 2022 年的第一天和第二天)

         Date Dump     VC5 Start     VC2 Start
0    2022.001   01  1325029429.0  1325029440.0
1    2022.001   02  1325030705.0  1325030760.0
2    2022.001   03  1325034031.0  1325034060.0
3    2022.001   04  1325035511.0  1325035560.0
4    2022.001   05  1325036791.0  1325036879.0
..        ...  ...           ...           ...
103  2022.002   48  1325188946.0  1325188980.0
104  2022.002   49  1325191628.0  1325191680.0
105  2022.002   50  1325192627.0  1325192640.0
106  2022.002   51  1325195052.0  1325195100.0
107  2022.002   52  1325198890.0  1325198940.0

It sounds like you might be better off doing a .join() / .merge() instead of trying to explicitly find the row index yourself.听起来您最好执行.join() / .merge()而不是尝试自己显式查找行索引。 For example if both dataframes are indexed by Date and dump , you could do df1.merge(df2, on=['Date', 'dump']) (or something to that effect).例如,如果两个数据帧都由Datedump索引,您可以执行df1.merge(df2, on=['Date', 'dump']) (或类似的操作)。

If you are interested, there is an excellent summary of join() and merge() on this answer .如果你有兴趣,这个答案上有一个关于join()merge()的精彩总结。 Basically if both dataframes have matching index, and you wish to join on the index, you can use df1.join(df2) to save typing.基本上,如果两个数据框都有匹配的索引,并且您希望加入索引,则可以使用df1.join(df2)来保存输入。 merge() is more flexible in that you can specify various combinations of index or columns to do the join on. merge()更灵活,因为您可以指定索引或列的各种组合来进行连接。

Also worth knowing is pd.concat , (see docshere ) which is another useful function when you are combining data.同样值得知道的是pd.concat ,(请参阅此处的文档),当您合并数据时,它是另一个有用的 function。 In particular, it can be more efficient (and more readable) if you need to join many dataframes, since you can call it on a list of dataframes in one line instead of having to loop through and join multiple times.特别是,如果您需要连接许多数据帧,它会更有效(并且更具可读性),因为您可以在一行中的数据帧列表中调用它,而不必循环并多次连接。

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM