简体   繁体   English

熊猫在较小的数据框中合并或合并

[英]Pandas merge or join in smaller dataframe

I have an issue whereby I have one long dataframe and one short dataframe, and I want to merge so that the shorter dataframe repeats itself to fill the length of the longer (left) df. 我有一个问题,即我有一个长数据帧和一个短数据帧,并且我想合并,以便较短的数据帧重复自身以填充较长(左)df的长度。

df1:

| Index  | Wafer | Chip | Value |
---------------------------------
| 0      | 1     | 32   | 0.99  |
| 1      | 1     | 33   | 0.89  |
| 2      | 1     | 39   | 0.96  |
| 3      | 2     | 32   | 0.81  |
| 4      | 2     | 33   | 0.87  |

df2:

| Index  |   x   |   y  |
-------------------------
| 0      |   1   |   3  |
| 1      |   2   |   2  |
| 2      |   1   |   6  |


df_combined:

| Index  | Wafer | Chip | Value |   x   |   y   |
-------------------------------------------------
| 0      | 1     | 32   | 0.99  |   1   |   3   |
| 1      | 1     | 33   | 0.89  |   2   |   2   |
| 2      | 1     | 39   | 0.96  |   1   |   6   |
| 3      | 2     | 32   | 0.81  |   1   |   3   |  <--- auto-repeats...
| 4      | 2     | 33   | 0.87  |   2   |   2   |

Is this a built in join/merge-type, or requiring a loop of some sort? 这是内置的join / merge-type,还是需要某种循环?

{This is just false data, but dfs are over 1000 rows...} {这只是错误的数据,但dfs超过1000行...}

Current code is a simple outer merge, but doesn't provide the fill/repeat to end: 当前代码是一个简单的外部合并,但没有提供填充/重复结束:

df = main.merge(df_coords, left_index=True, right_index = True, how='outer') and just gives NaNs. df = main.merge(df_coords, left_index=True, right_index = True, how='outer')并给出NaN。

I've checked around: Merge two python pandas data frames of different length but keep all rows in output data frame pandas: duplicate rows from small dataframe to large based on cell value 我检查了一下: 合并两个不同长度的python pandas数据帧,但将所有行保留在输出数据帧 pandas中:根据单元格值将行从小 数据框 复制到大 数据框

and it feels like this could be an arguement somewhere in a merge function... but I can't find it. 感觉这可能是合并功能中的某个争论……但我找不到它。 Any help gratefully received. 非常感谢任何帮助。

Thanks 谢谢

You can repeat df2 until it's as long as df1 , then reset_index and merge : 您可以重复df2直到它与df1一样长,然后reset_indexmerge

new_len = round(len(df1)/len(df2))
repeated = (pd.concat([df2] * new_len)
              .reset_index()
              .drop(["index"], 1)
              .iloc[:len(df1)])

repeated
   x  y
0  1  3
1  2  2
2  1  6
3  1  3
4  2  2

df1.merge(repeated, how="outer", left_index=True, right_index=True)
   Wafer  Chip  Value   x  y
0      1    32    0.99  1  3
1      1    33    0.89  2  2
2      1    39    0.96  1  6
3      2    32    0.81  1  3
4      2    33    0.87  2  2

A little hacky, but it should work. 有点hacky,但应该可以。

Note: I'm assuming your Index column is not actually a column, but is in fact intended to represent the data frame index. 注意:我假设您的Index列实际上不是列,但实际上旨在表示数据帧索引。 I'm making this assumption because you refer to left_index / right_index args in your merge() code. 我做这个假设是因为您在merge()代码中引用了left_index / right_index args。 If Index is actually its own column, this code will basically work, you'll just need to drop Index as well if you don't want it in the final df . 如果Index实际上是它自己的列,则此代码基本上可以工作,如果您不希望在最终df它,则只需drop Index

You can achieve this with a left join on the value of df1["Index"] mod the length of df2["Index"] : 您可以通过在df1["Index"]的值上左连接mod df2["Index"]的长度来实现:

# Creating Modular Index values on df1
n = df2.shape[0]
df1["Modular Index"] = df1["Index"].apply(lambda x: str(int(x)%n))

# Merging dataframes
df_combined = df1.merge(df2, how="left", left_on="Modular Index", right_on="Index")

# Dropping unnecessary columns
df_combined = df_combined.drop(["Modular Index", "Index_y"], axis=1)

print(df_combined)

0 Index_x Wafer Chip Value  x  y
0       0     1   32  0.99  1  3
1       1     1   33  0.89  2  2
2       2     1   39  0.96  1  6
3       3     2   32  0.81  1  3
4       4     2   33  0.87  2  2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM