简体   繁体   English

仅绘制大熊猫数据框中的唯一行

[英]Plot only unique rows from large pandas dataframe

I have a pandas dataframe of 434300 rows with the following structure: 我有一个434300行的熊猫数据框,结构如下:

       x    y        p1  p2 
1      8.0  1.23e-6  10  12
2      7.9  4.93e-6  10  12
3      7.8  7.10e-6  10  12
...
.
...
4576   8.0  8.85e-6  5   16
4577   7.9  2.95e-6  5   16
4778   7.8  3.66e-6  5   16
...
...
...
434300 ...

with the key point being that for every block of varying x,y data there are p1 and p2 that do not vary . 关键是对于x,y数据变化的每个块,都有p1和p2 不变 Note that these blocks of constant p1,p2 are of varying length so it is not simply a matter of slicing the data every n rows. 请注意,常数p1,p2的这些块的长度是变化的,因此,这不仅仅是简单地每n行对数据进行切片的问题。

I would like to plot the values p1 vs p2 in a graph, but would only like to plot the unique points. 我想在图中绘制值p1 vs p2,但只想绘制唯一点。

If i do plot p1 vs p2 using: 如果我使用以下方法绘制p1 vs p2:

In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 434300

I see that matplotlib is plotting each individual line of data which is to be expected. 我看到matplotlib正在绘制每条预期的数据行。

What is the neatest way to plot only the unique points from columns p1 and p2? 最简单的方法是仅绘制列p1和p2中的唯一点?

Here is a csv of a small example dataset that has all of the important features of my dataset. 是一个小型示例数据集的csv,具有我的数据集的所有重要功能。

只需删除重复项并绘图:

df.drop_duplicates(how='all', columns=['p1', 'p2'])[['p1', 'p2]].plot()

You can slice the p1 and p2 columns from the data frame and then drop duplicates before plotting. 您可以从数据框中p1p2列,然后在绘制之前删除重复项。

sub_df = df[['p1','p2']].drop_duplicates()
fig, ax = plt.subplots(1,1)
ax.plot(sub_df['p1'],sub_df['p2'])
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('exampleData.csv')

d = data[['p1', 'p2']].drop_duplicates()

plt.plot(d['p1'], d['p2'], 'o')
plt.show()

在此处输入图片说明

After looking at this answer to a similar question in R (which is what the pandas dataframes are based on) I found the pandas function pandas.Dataframe.drop_duplicates . 细算这个答案R中类似的问题(这是大熊猫dataframes是基于什么),我发现大熊猫功能pandas.Dataframe.drop_duplicates If we modify my example code as follows: 如果我们如下修改我的示例代码:

In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: df.drop_duplicates(subset=['p1','p2'],inplace=True)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 15

We see that this restricts df to only the unique points to be plotted. 我们看到这将df限制为仅要绘制的唯一点。 An important point is that you must pass a subset to drop_duplicates so that it only uses those columns to determine duplicate rows. 重要的一点是,您必须将子集传递给drop_duplicates以便它仅使用这些列来确定重复的行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM