简体   繁体   English

根据给定条件从数据框中过滤特定数据点

[英]Filtering specific data points from a dataframe based on a given conditions

I have a Dataframe like below 我有一个如下的数据帧

+----------+-------+-------+-------+-------+-------+
|   Date   | Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
+----------+-------+-------+-------+-------+-------+
| 1-Jan-19 |    50 |     0 |    40 |    80 |    60 |
| 2-Jan-19 |    60 |    80 |    60 |    80 |    90 |
| 3-Jan-19 |    80 |    20 |     0 |    50 |    30 |
| 4-Jan-19 |    90 |    20 |    10 |    90 |    20 |
| 5-Jan-19 |    80 |     0 |    10 |    10 |     0 |
| 6-Jan-19 |   100 |    90 |   100 |     0 |    10 |
| 7-Jan-19 |    20 |    10 |    30 |    20 |     0 |
+----------+-------+-------+-------+-------+-------+

I want to extract all the data points (row label & column Label) if the value is zero and produce a new dataframe. 如果值为零,我想提取所有数据点(行标签和列标签)并生成新的数据帧。

my desired output is as below 我想要的输出如下

+--------------+----------------+
| Missing Date | Missing column |
+--------------+----------------+
| 1-Jan-19     | Loc 2          |
| 3-Jan-19     | Loc 3          |
| 5-Jan-19     | Loc 2          |
| 5-Jan-19     | Loc 5          |
| 6-Jan-19     | Loc 4          |
| 7-Jan-19     | Loc 5          |
+--------------+----------------+

Note on 5-Jan-19 , there are two entries Loc 2 & Loc 5 . 注意,在5-Jan-19Loc 2Loc 5有两个条目。

I know how to do this in Excel VBA. 我知道如何在Excel VBA中执行此操作。 But, I'm looking for a more scalable solution with python-pandas . 但是,我正在寻找一种更具可扩展性的python-pandas解决方案。

so far i have attempted with the below code 到目前为止,我尝试使用以下代码

import pandas as pd

df = pd.read_csv('data.csv')

new_df = pd.DataFrame(columns=['Missing Date','Missing Column'])

for c in df.columns:
    if c != 'Date':
        if df[df[c] == 0]:
            new_df.append(df[c].index, c)

I'm new to pandas. 我是熊猫的新手。 Hence, guide me how to solve this issue. 因此,指导我如何解决这个问题。

melt + query melt + query

(df.melt(id_vars='Date', var_name='Missing column')
   .query('value == 0')
   .drop(columns='value')
)

        Date Missing column
7   1-Jan-19          Loc 2
11  5-Jan-19          Loc 2
16  3-Jan-19          Loc 3
26  6-Jan-19          Loc 4
32  5-Jan-19          Loc 5
34  7-Jan-19          Loc 5

Melt the dateframe using the date column as id_vars , then filter where the value is zero (eg using .loc[lambda x: x['value'] == 0] ). 使用日期列将日期id_varsid_vars ,然后过滤值为零的位置(例如,使用.loc[lambda x: x['value'] == 0] )。 Now it is just clean-up: 现在它只是清理:

  • sort values on Date and Missing column 在“ Date和“ Missing column上排序值
  • drop the value column (they all contain zeros) 删除value列(它们都包含零)
  • rename Date to Missing Date Date重命名为Missing Date
  • reset the index, dropping the original 重置索引,删除原始索引

.

df = pd.DataFrame({
    'Date': pd.date_range('2019-1-1', '2019-1-7'),
    'Loc 1': [50, 60, 80, 90, 80, 100, 20],
    'Loc 2': [0, 80, 20, 20, 0, 90, 10],
    'Loc 3': [40, 60, 0, 10, 10, 100, 30],
    'Loc 4': [80, 80, 50, 90, 10, 0, 20],
    'Loc 5': [60, 90, 30, 20, 0, 10, 0],
})

df2 = (
    df
    .melt(id_vars='Date', var_name='Missing column')
    .loc[lambda x: x['value'] == 0]
    .sort_values(['Date', 'Missing column'])
    .drop('value', axis='columns')
    .rename({'Date': 'Missing Date'})
    .reset_index(drop=True)
)
>>> df2
        Date Missing column
0 2019-01-01          Loc 2
1 2019-01-03          Loc 3
2 2019-01-05          Loc 2
3 2019-01-05          Loc 5
4 2019-01-06          Loc 4
5 2019-01-07          Loc 5

I'm the crazy answer, 我是个疯狂的回答,

You can use for the dates : 您可以使用日期:

new_dates = pd.np.repeat(df.index, df.eq(0).sum(axis=1).values)

Replace df.index with df['Date'] if necessary. 如有必要,用df['Date']替换df.index


And for the values 而对于价值观

cols = pd.np.where(df.eq(0), df.columns, pd.np.NaN) 
new_cols = cols[pd.notnull(cols)] 

Finally, 最后,

new_df = pd.DataFrame(new_cols, index=new_dates, columns =['Missing column'])

alternatively you can create a new column instead of an index. 或者,您可以创建新列而不是索引。

Now how does that work ? 现在这是如何工作的?

new_dates takes the series and repeat each value as many times as their are True values in that row. new_dates接受该系列并重复每个值,因为它们是该行中的True值。 I summed the True values over eachrow since they are equal to 1. Meaning, True when df.eq(0) . 我将每个行的True值相加,因为它们等于1.含义,当df.eq(0)时为真。

Next, I call a filter that gives the column name if the value is zero, NaN otherwise. 接下来,我调用一个过滤器,如果值为零则给出列名,否则为NaN。

Finally, we only get the non NaN values and put them in an array which we end up using to build your answer. 最后,我们只得到非NaN值并将它们放在一个数组中,我们最终用它来构建你的答案。

NB : I used that toy data as example : 注意:我以玩具数据为例:

df = pd.DataFrame(
    {
        "A":pd.np.random.randint(0,3,20),                                                               
        "B":pd.np.random.randint(0,3,20),
        "C":pd.np.random.randint(0,3,20), 
        "D":pd.np.random.randint(0,3,20)
    }, 
    index = pd.date_range("2019-01-01", periods=20, freq="D")
)

I managed to solve this with iterrows() . 我设法用iterrows()来解决这个问题。

import pandas as pd
df = pd.read_csv('data.csv')

cols = ['Missing Date','Missing Column']
data_points = []

for index, row in df.iterrows():
    for c in df.columns:
        if row[c] == 0:
            data_points.append([row['Date'],c])

df_final = pd.DataFrame(df_final = pd.DataFrame(data_points, columns=cols), columns=cols)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM