[英]Merging csv files using python and pandas (overlapping rows)
I try to update stock data in one csv file with new rows in another. 我尝试在一个csv文件中更新库存数据,在另一个csv文件中使用新行。 Because of the way I retrieve this data, the rows do partly overlap.
由于我检索此数据的方式,行部分重叠。 The basic stock file contains (simplified example):
基本库存文件包含(简化示例):
Mar 08, 2016 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% Mar 07, 2016 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% Mar 04, 2016 9824.17 9800.86 9899.11 9742.76 93.45M 0.74% Mar 03, 2016 9751.92 9807.06 9808.52 9709.68 85.25M -0.25% Mar 02, 2016 9776.62 9780.84 9837.11 9695.98 106.45M 0.61% Mar 01, 2016 9717.16 9482.66 9719.02 9471.09 99.54M 2.34% Feb 29, 2016 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
This file should be updated with the data from a second file: 应使用第二个文件中的数据更新此文件:
Mar 11, 2016 9831.13 9672.05 9833.90 9642.79 118.96M 3.51% Mar 10, 2016 9498.15 9697.64 9995.84 9498.15 177.50M -2.31% Mar 09, 2016 9723.09 9700.16 9838.95 9679.19 100.90M 0.31% Mar 08, 2016 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% Mar 07, 2016 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
The code I use to try to achieve the update looks like that: 我用来尝试实现更新的代码如下:
existingquotes = pd.read_csv(filenames_quotes[i], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
newquotes = pd.read_csv(filenames_upd[i], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
existingquotes.update(newquotes)
mergedquotes=existingquotes
print mergedquotes
The output looks like this: 输出如下所示:
0 1 2 3 4 5 6 0 2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51% 1 2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31% 2 2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31% 3 2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% 4 2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% 5 2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34% 6 2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
There is a gap between 2016-03-01 and 016-03-07. 2016-03-01至016-03-07之间存在差距。 If I use
如果我使用
existingquotes.update(newquotes), overwrite=False)
the update looks like the original csv. 更新看起来像原始的csv。 Appreciate any help!
感谢任何帮助!
You can first add parameter index_col=[0]
to read_csv
for set first column to Datetimeindex
, then reindex
by union of both indexes and last use function combine_first
for filling NaN
by values of DataFrame
newquotes
: 可以先添加参数
index_col=[0]
至read_csv
用于设置第一列Datetimeindex
,然后reindex
由索引和最后使用功能联合combine_first
用于填充NaN
通过的值DataFrame
newquotes
:
print existingquotes
1 2 3 4 5 6
0
2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88%
2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74%
2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25%
2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61%
2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34%
2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
print newquotes
1 2 3 4 5 6
0
2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51%
2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31%
2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31%
2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88%
2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
existingquotes = existingquotes.reindex(existingquotes.index.union(newquotes.index))
print existingquotes
1 2 3 4 5 6
0
2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34%
2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61%
2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25%
2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74%
2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88%
2016-03-09 NaN NaN NaN NaN NaN NaN
2016-03-10 NaN NaN NaN NaN NaN NaN
2016-03-11 NaN NaN NaN NaN NaN NaN
If overlapping values are different in both DataFrames
, you can add: 如果两个
DataFrames
重叠值不同,您可以添加:
existingquotes.loc[existingquotes.index.intersection(newquotes.index),:] = np.nan
But in this sample are the same, so it can be omited. 但是这个样本是相同的,所以可以省略。
print existingquotes.combine_first(newquotes)
1 2 3 4 5 6
0
2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34%
2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61%
2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25%
2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74%
2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88%
2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31%
2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31%
2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51%
Instead combine_first
you can use fillna
: 相反,
combine_first
你可以使用fillna
:
print existingquotes.fillna(newquotes)
1 2 3 4 5 6
0
2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19%
2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34%
2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61%
2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25%
2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74%
2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46%
2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88%
2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31%
2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31%
2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51%
Thank you all, it worked like a charm. 谢谢大家,它就像一个魅力。 The final code looks like this:
最终代码如下所示:
existingquotes = pd.read_csv(filenames_quotes[i], index_col=[0], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
newquotes = pd.read_csv(filenames_upd[i], index_col=[0], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
existingquotes = existingquotes.reindex(existingquotes.index.union(newquotes.index))
existingquotes = existingquotes.fillna(newquotes)
print mergedquotes
and leads to the intended result (same as jezrael posted) 并导致预期的结果(与jezrael发布的相同)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.