[英]Filtering specific data points from a dataframe based on a given conditions
I have a Dataframe like below 我有一个如下的数据帧
+----------+-------+-------+-------+-------+-------+
| Date | Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
+----------+-------+-------+-------+-------+-------+
| 1-Jan-19 | 50 | 0 | 40 | 80 | 60 |
| 2-Jan-19 | 60 | 80 | 60 | 80 | 90 |
| 3-Jan-19 | 80 | 20 | 0 | 50 | 30 |
| 4-Jan-19 | 90 | 20 | 10 | 90 | 20 |
| 5-Jan-19 | 80 | 0 | 10 | 10 | 0 |
| 6-Jan-19 | 100 | 90 | 100 | 0 | 10 |
| 7-Jan-19 | 20 | 10 | 30 | 20 | 0 |
+----------+-------+-------+-------+-------+-------+
I want to extract all the data points (row label & column Label) if the value is zero and produce a new dataframe. 如果值为零,我想提取所有数据点(行标签和列标签)并生成新的数据帧。
my desired output is as below 我想要的输出如下
+--------------+----------------+
| Missing Date | Missing column |
+--------------+----------------+
| 1-Jan-19 | Loc 2 |
| 3-Jan-19 | Loc 3 |
| 5-Jan-19 | Loc 2 |
| 5-Jan-19 | Loc 5 |
| 6-Jan-19 | Loc 4 |
| 7-Jan-19 | Loc 5 |
+--------------+----------------+
Note on 5-Jan-19
, there are two entries Loc 2
& Loc 5
. 注意,在
5-Jan-19
, Loc 2
和Loc 5
有两个条目。
I know how to do this in Excel VBA. 我知道如何在Excel VBA中执行此操作。 But, I'm looking for a more scalable solution with
python-pandas
. 但是,我正在寻找一种更具可扩展性的
python-pandas
解决方案。
so far i have attempted with the below code 到目前为止,我尝试使用以下代码
import pandas as pd
df = pd.read_csv('data.csv')
new_df = pd.DataFrame(columns=['Missing Date','Missing Column'])
for c in df.columns:
if c != 'Date':
if df[df[c] == 0]:
new_df.append(df[c].index, c)
I'm new to pandas. 我是熊猫的新手。 Hence, guide me how to solve this issue.
因此,指导我如何解决这个问题。
melt
+ query
melt
+ query
(df.melt(id_vars='Date', var_name='Missing column')
.query('value == 0')
.drop(columns='value')
)
Date Missing column
7 1-Jan-19 Loc 2
11 5-Jan-19 Loc 2
16 3-Jan-19 Loc 3
26 6-Jan-19 Loc 4
32 5-Jan-19 Loc 5
34 7-Jan-19 Loc 5
Melt the dateframe using the date column as id_vars
, then filter where the value is zero (eg using .loc[lambda x: x['value'] == 0]
). 使用日期列将日期
id_vars
为id_vars
,然后过滤值为零的位置(例如,使用.loc[lambda x: x['value'] == 0]
)。 Now it is just clean-up: 现在它只是清理:
Date
and Missing column
Date
和“ Missing column
上排序值 value
column (they all contain zeros) value
列(它们都包含零) Date
to Missing Date
Date
重命名为Missing Date
. 。
df = pd.DataFrame({
'Date': pd.date_range('2019-1-1', '2019-1-7'),
'Loc 1': [50, 60, 80, 90, 80, 100, 20],
'Loc 2': [0, 80, 20, 20, 0, 90, 10],
'Loc 3': [40, 60, 0, 10, 10, 100, 30],
'Loc 4': [80, 80, 50, 90, 10, 0, 20],
'Loc 5': [60, 90, 30, 20, 0, 10, 0],
})
df2 = (
df
.melt(id_vars='Date', var_name='Missing column')
.loc[lambda x: x['value'] == 0]
.sort_values(['Date', 'Missing column'])
.drop('value', axis='columns')
.rename({'Date': 'Missing Date'})
.reset_index(drop=True)
)
>>> df2
Date Missing column
0 2019-01-01 Loc 2
1 2019-01-03 Loc 3
2 2019-01-05 Loc 2
3 2019-01-05 Loc 5
4 2019-01-06 Loc 4
5 2019-01-07 Loc 5
I'm the crazy answer, 我是个疯狂的回答,
You can use for the dates : 您可以使用日期:
new_dates = pd.np.repeat(df.index, df.eq(0).sum(axis=1).values)
Replace df.index
with df['Date']
if necessary. 如有必要,用
df['Date']
替换df.index
。
And for the values 而对于价值观
cols = pd.np.where(df.eq(0), df.columns, pd.np.NaN)
new_cols = cols[pd.notnull(cols)]
Finally, 最后,
new_df = pd.DataFrame(new_cols, index=new_dates, columns =['Missing column'])
alternatively you can create a new column instead of an index. 或者,您可以创建新列而不是索引。
Now how does that work ? 现在这是如何工作的?
new_dates
takes the series and repeat each value as many times as their are True
values in that row. new_dates
接受该系列并重复每个值,因为它们是该行中的True
值。 I summed the True
values over eachrow since they are equal to 1. Meaning, True when df.eq(0)
. 我将每个行的
True
值相加,因为它们等于1.含义,当df.eq(0)
时为真。
Next, I call a filter that gives the column name if the value is zero, NaN otherwise. 接下来,我调用一个过滤器,如果值为零则给出列名,否则为NaN。
Finally, we only get the non NaN values and put them in an array which we end up using to build your answer. 最后,我们只得到非NaN值并将它们放在一个数组中,我们最终用它来构建你的答案。
NB : I used that toy data as example : 注意:我以玩具数据为例:
df = pd.DataFrame(
{
"A":pd.np.random.randint(0,3,20),
"B":pd.np.random.randint(0,3,20),
"C":pd.np.random.randint(0,3,20),
"D":pd.np.random.randint(0,3,20)
},
index = pd.date_range("2019-01-01", periods=20, freq="D")
)
I managed to solve this with iterrows()
. 我设法用
iterrows()
来解决这个问题。
import pandas as pd
df = pd.read_csv('data.csv')
cols = ['Missing Date','Missing Column']
data_points = []
for index, row in df.iterrows():
for c in df.columns:
if row[c] == 0:
data_points.append([row['Date'],c])
df_final = pd.DataFrame(df_final = pd.DataFrame(data_points, columns=cols), columns=cols)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.