简体   繁体   English

如何过滤此python数据帧

[英]how to filter this python dataframe

Greeting I try to get the smallest sizes dataframe that got valid row 问候我尝试获得有效行的最小尺寸数据框

import pandas as pd
import random

columns = ['x0','y0']
df_ = pd.DataFrame(index=range(0,30), columns=columns)
df_ = df_.fillna(0)


columns1 = ['x1','y1']

df = pd.DataFrame(index=range(0,11), columns=columns1)



for index, row in df.iterrows():
   df.loc[index, "x1"] = random.randint(1, 100)
   df.loc[index, "y1"] = random.randint(1, 100)

df_ = df_.combine_first(df)    

df = pd.DataFrame(index=range(0,17), columns=columns1)



for index, row in df.iterrows():
    df.loc[index, "x2"] = random.randint(1, 100)
    df.loc[index, "y2"] = random.randint(1, 100)

df_ = df_.combine_first(df)    

From the example the dataframe should output rows from 0 to 10 and the rest got filter out. 从示例中,数据帧应输出从0到10的行,其余部分被过滤掉。 I think of keep a counter to keep track of the min row or using pandasql or if there is a trick to get this info from the dataframe the size of dataframe 我认为要保留一个计数器来跟踪最小行或使用pandasql,或者是否有技巧从数据框的大小中获取此信息

Actually I will be appending 500+ files with various size to append and use it to do some analysis. 实际上,我将附加500+个具有各种大小的文件以附加并使用它进行一些分析。 So perf is a consideration. 因此,性能是一个考虑因素。

-student of python -python的学生

If you want to drop the rows which have NaNs use dropna (here, this is the first ten rows): 如果要删除具有NaN的行,请使用dropna(此处为前十行):

In [11]: df_.dropna()
Out[11]:
    x0  x1  x2  y0  y1  y2
0    0  49  58   0  68   2
1    0   2  37   0  19  71
2    0  26  95   0  12  17
3    0  87   5   0  70  69
4    0  84  77   0  70  92
5    0  71  98   0  22   5
6    0  28  95   0  70  15
7    0  31  19   0  24  31
8    0   9  37   0  55  29
9    0  30  53   0  15  45
10   0   8  61   0  74  41

However a cleaner, more efficient, and faster way to do this entire process is to update just those first rows (I'm assuming the random integer stuff is just you generating some example dataframes). 但是,完成整个过程的一种更清洁,更高效,更快捷的方法是仅更新第一行(我假设随机整数只是您生成一些示例数据帧)。

Let's store your DataFrames in a list: 让我们将DataFrames存储在一个列表中:

In [21]: df1 = pd.DataFrame([[1, 2], [np.nan, 4]], columns=['a', 'b'])

In [22]: df2 = pd.DataFrame([[1, 2], [5, 6], [7, 8]], columns=['a', 'c'])

In [23]: dfs = [df1, df2]

Take the minimum length: 取最小长度:

In [24]: m = min(len(df) for df in dfs)

First create an empty DataFrame with the desired rows and columns: 首先使用所需的行和列创建一个空的DataFrame:

In [25]: columns = reduce(lambda x, y: y.columns.union(x), dfs, [])

In [26]: res = pd.DataFrame(index=np.arange(m), columns=columns)

To do this efficiently we're going to update, and making these changes inplace - on just this DataFrame*: 为了有效地做到这一点,我们将进行更新,并就此进行这些更改-仅在此DataFrame *上:

In [27]: for df in dfs:
             res.update(df)

In [28]: res
Out[28]:
   a  b  c
0  1  2  2
1  5  4  6

*If we didn't do this, or were using combine_first or similar, we'd most likely have lots of copying (new DataFrames being created), which will slow things down. *如果我们不这样做,或者正在使用combine_first或类似方法,则很可能会有大量复制(正在创建新的DataFrame),这会使事情变慢。

Note: combine_first doesn't offer an inplace flag... you could use combine but this is also more complicated (as well as less efficient). 注意: combine_first不提供就地标志...您可以使用combine_first ,但这也更复杂(效率更低)。 It's also quite straightforward to use where (and manually update), which IIRC is what combine does under the hood. 使用IIRC(在哪里进行手动更新)也很简单,而IIRC是组合在后台进行的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM