簡體   English   中英

在Python中合並兩個數據集

[英]Merge two datasets in Python

我有兩組xy數據,應將其x值合並。 為了說明這一點,第一個集合看起來像這樣:

0.5;3.4
0.8;3.8
0.9;1.2
1.3;1.1
1.9;2.3

第二組是這樣的:

0.3;-0.2
0.8;-0.9
1.0;0.1
1.5;1.2
1.6;6.3

數據位於兩個單獨的csv文件中。 我想將兩個文件合並為一個,以便x值按順序排列,並且y值出現在兩列中,並且它們的(線性)插值( y1y2 )已完成。 第二列包含第一個數據集的y值(加上插值),第三列包含第二個數據集的y值。

0.3;y1;-0.2
0.5;3.4;y2
0.8;3.8;-0.9
0.9;1.2;y2
1.0;y1;0.1
1.3;1.1;y2
1.5;y1;1.2
1.6;y1;6.3
1.9;2.3;y2

到目前為止,我唯一的想法是將數據讀取到numpy數組中,將它們連接在一起,對值進行排序,並計算前值和后值的平均值(以防萬一該值為空)。

有沒有更優雅的方法可以在Python中做到這一點?

編輯:這是我的嘗試。 盡管腳本很長,但它可以工作並提供我想象的結果。

#-*- coding: utf-8 -*-

import numpy as np
from matplotlib import pyplot as plt
from scipy.interpolate import interp1d
import csv

# Read data files and turn them into numpy array for further processing
def read_datafile(file_name):
    data = np.loadtxt(file_name, delimiter=";")
    return data

data1 = read_datafile("testcsv1.csv")
data2 = read_datafile("testcsv2.csv")

# Add empty column at the appropriate position
emptycol1 = np.empty((len(data1), 3))
emptycol1[:] = np.nan
emptycol2 = np.empty((len(data2), 3))
emptycol2[:] = np.nan
emptycol1[:,:-1] = data1
emptycol2[:,[0, 2]] = data2

# Merge and sort the data sets. Create empty array to add final results
merged_temp = np.concatenate((emptycol1, emptycol2))
merged_temp = np.array(sorted(merged_temp, key = lambda x: float(x[0])))
merged = np.empty((1, 3))

# Check for entries where the x values already match. Merge those into one row
i = 0
while i < len(merged_temp)-1:
    if merged_temp[i, 0] == merged_temp[i+1, 0]:
        newrow = np.array([merged_temp[i, 0], merged_temp[i, 1], merged_temp[i+1, 2]])
        merged = np.vstack((merged, newrow))
        i += 2
    else:
        newrow = np.array([merged_temp[i, 0], merged_temp[i, 1], merged_temp[i, 2]])
        merged = np.vstack((merged, newrow))
        i += 1

# Check for so far undefined values (gaps in the data). Interpolate between them (linearly)
for i in range(len(merged)-1):
    # First y column
    if np.isnan(merged[i, 1]) == True:
        # If only one value is missing (maybe not necessary to separate this case)
        if (np.isnan(merged[i-1, 1]) == False) and (np.isnan(merged[i+1, 1]) == False):
            merged[i, 1] = (merged[i-1, 1] + merged[i+1, 1])/2
        # If two or more values are missing
        elif np.isnan(merged[i, 1]) == True:
            l = 0
            while (np.isnan(merged[i+l, 1]) == True) and (i+l != len(merged)-1):
                l += 1
            x1 = np.array([i-1, i+l])                       # endpoints
            x = np.linspace(i, i+l-1, l, endpoint=True)     # missing points
            y = np.array([merged[i-1, 1], merged[i+l, 1]])  # values at endpoints
            f = interp1d(x1, y)                             # linear interpolation
            for k in x:
                merged[k, 1] = f(k)
    # Second y column
    if np.isnan(merged[i, 2]) == True:
        # If only one value is missing
        if (np.isnan(merged[i-1, 2]) == False) and (np.isnan(merged[i+1, 2]) == False):
            merged[i, 2] = (merged[i-1, 2] + merged[i+1, 2])/2
        # If two or more values are missing
        elif np.isnan(merged[i, 2]) == True:
            l = 0
            while (np.isnan(merged[i+l, 2]) == True) and (i+l != len(merged)-1):
                l += 1
            x1 = np.array([i-1, i+l])                       # endpoints
            x = np.linspace(i, i+l-1, l, endpoint=True)     # missing points
            y = np.array([merged[i-1, 2], merged[i+l, 2]])  # values at endpoints
            f = interp1d(x1, y)                             # linear interpolation
            for k in x:
                merged[k, 2] = f(k)

# Remove lines which still have "nan" values (beginning and end). This could be prevented by an extrapolation
merged = merged[~np.isnan(merged).any(axis=1)]
merged = np.delete(merged, (0), axis=0)

# Write table to new csv file in the same directory
with open("testcsv_merged.csv", "w") as mergedfile:
    writer = csv.writer(mergedfile)
    [writer.writerow(r) for r in merged]

我將使用pandas進行這種處理:

import pandas as pd
#I assumed you have no headers in the data files
df1 = pd.read_csv('./dataset1.txt',sep=';',header=None)
df2 = pd.read_csv('./dataset2.txt',sep=';',header=None)
#Join the datasets using full outer join on the first column in both datasets
df_merged = df1.merge(df2, on=0, how='outer')
#Fill the nulls with the desirable values in this case the average of the column
df_merged['1_x'].fillna(df_merged['1_x'].mean(),inplace=True)
df_merged['1_y'].fillna(df_merged['1_y'].mean(),inplace=True)

輸出:

print(df_merged)
    0   1_x 1_y
0   0.5 3.4 y2
1   0.8 3.8 -0.9
2   0.9 1.2 y2
3   1.3 1.1 y2
4   1.9 2.3 y2
5   0.3 y1  -0.2
6   1.0 y1  0.1
7   1.5 y1  1.2
8   1.6 y1  6.3

您可以輕松更改列名稱:

df_merged.columns = ['col1','col2','col3']

您還可以使用sort_values方法輕松地對值進行排序:

df_merged.sort_values('col1')

最后,您可以使用以下命令輕松地將此最終DataFrame轉換為numpy數組:

import numpy as np
np.array(df_merged)

一個襯里: dfi = pd.merge(df1,df2,'outer',0).set_index(0).sort_index().interpolate()

In [383]: dfi
Out[383]: 
      1_x   1_y
0              
0.3   NaN -0.20
0.5  3.40 -0.55
0.8  3.80 -0.90
0.9  1.20 -0.40
1.0  1.15  0.10
1.3  1.10  0.65
1.5  1.50  1.20
1.6  1.90  6.30
1.9  2.30  6.30

完整的pandas版本+ numpy插值可在邊緣進行更好的調整:

#df1 = pd.read_clipboard(header=None,sep=';')
#df2 = pd.read_clipboard(header=None,sep=';')

import pylab as pl

df = pd.merge(df1,df2,'outer',0).sort_values(0)
df['y1']=scipy.interpolate.interp1d(*df1.values.T,fill_value='extrapolate')(df[0])
df['y2']=scipy.interpolate.interp1d(*df2.values.T,fill_value='extrapolate')(df[0])

ax=pl.gca()
df1.set_index(0).plot(lw=0,marker='o',ax=ax)
df2.set_index(0).plot(lw=0,marker='o',ax=ax)
df.set_index(0).loc[:,['y1','y2']].plot(ax=ax)    
pl.show()

地塊:

在此處輸入圖片說明

資料:

In [344]: df1
Out[344]: 
     0    1
0  0.5  3.4
1  0.8  3.8
2  0.9  1.2
3  1.3  1.1
4  1.9  2.3

In [345]: df2
Out[345]: 
     0    1
0  0.3 -0.2
1  0.8 -0.9
2  1.0  0.1
3  1.5  1.2
4  1.6  6.3

In [346]: df
Out[346]: 
     0  1_x  1_y         y1         y2
5  0.3  NaN -0.2 -20.713281  -0.200000
0  0.5  3.4  NaN   3.400000  -3.021563
1  0.8  3.8 -0.9   3.800000  -0.900000
2  0.9  1.2  NaN   1.200000  -0.092830
6  1.0  NaN  0.1  -0.265527   0.100000
3  1.3  1.1  NaN   1.100000  -1.960323
7  1.5  NaN  1.2   3.760937   1.200000
8  1.6  NaN  6.3   4.701230   6.300000
4  1.9  2.3  NaN   2.300000  44.318059

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM