[英]keep only row right after subsequent row meets criteria
I'd like to know how I can keep only the rows when a subsequent row in the group meets a certain criteria. 我想知道当组中的下一行满足特定条件时如何仅保留行。 The following data illustrates what I am trying to achieve;
以下数据说明了我要实现的目标;
Data is sorted by ID
ascending and DATE
in descending order. 数据按
ID
升序和DATE
降序排序。
The same ID only has one row or zero rows where Purchased = 'N'
but can have zero, one, or more than one rows where Purchased = 'Y'
. 相同的ID在
Purchased = 'N'
只有一行或零行,而在Purchased = 'Y'
可以有零行,一行或多于一行。
I want to track the dates in which the EMPTY status changes; 我想跟踪EMPTY状态更改的日期;
ID EMPTY DATE
1 Y 03/01/2017
1 Y 02/01/2017
1 N 01/01/2017
2 Y 03/01/2017
3 N 03/01/2017
4 Y 03/01/2017
4 N 03/01/2017
4 Y 03/01/2017
4 Y 03/01/2017
Output: 输出:
I want to keep all the rows with EMPTY= 'N'
: 我想保留
EMPTY= 'N'
所有行:
ID EMPTY DATE
1 Y 02/01/2017
1 N 01/01/2017
2 Y 01/01/2017
3 N 03/01/2017
4 Y 03/01/2017
4 N 03/01/2017
I can use either sql
or python
to do this; 我可以使用
sql
或python
来执行此操作; so solutions for either or both languages are welcomed! 因此,欢迎使用其中一种或两种语言的解决方案!
In case you are actually interested in using R: 如果您实际上对使用R感兴趣:
library(dplyr)
df %>%
mutate(lag.empty = lead(df$EMPTY,1)) %>%
filter(lag.empty != EMPTY) %>%
select(-lag.empty)
# ID EMPTY DATE
#1 1 Y 02/01/2017
#2 1 N 01/01/2017
#3 2 Y 03/01/2017
#4 3 N 03/01/2017
#5 4 Y 03/01/2017
#6 4 N 03/01/2017
Data: 数据:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 4L, 4L, 4L, 4L), EMPTY = structure(c(2L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"),
DATE = structure(c(3L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("01/01/2017",
"02/01/2017", "03/01/2017"), class = "factor")), .Names = c("ID",
"EMPTY", "DATE"), class = "data.frame", row.names = c(NA, -9L))
One way with dplyr
in R
R
dplyr
一种方法
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()==1 |(cumsum(cumsum(EMPTY == "N"))<2 & !duplicated(EMPTY)) )
# A tibble: 6 x 3
# Groups: ID [4]
# ID EMPTY DATE
# <int> <chr> <chr>
#1 1 Y 03/01/2017
#2 1 N 01/01/2017
#3 2 Y 03/01/2017
#4 3 N 03/01/2017
#5 4 Y 03/01/2017
#6 4 N 03/01/2017
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 4L, 4L, 4L, 4L), EMPTY = c("Y",
"Y", "N", "Y", "N", "Y", "N", "Y", "Y"), DATE = c("03/01/2017",
"02/01/2017", "01/01/2017", "03/01/2017", "03/01/2017", "03/01/2017",
"03/01/2017", "03/01/2017", "03/01/2017")), .Names = c("ID",
"EMPTY", "DATE"), class = "data.frame", row.names = c(NA, -9L
))
In my experience this is a much prettier task in R, but since you are looking for a python solution: 以我的经验,这是R中更漂亮的任务,但是由于您正在寻找python解决方案:
dict = {'id':id,'empty':empty,'date':date}
df1 = pd.DataFrame(dict)
After loading into a pd dataframe by method of your choice: 通过您选择的方法加载到pd数据框中后:
lag = list(df1.loc[1:,'empty'])
lag.append('NULL') ##to make list match frame rowcount
df1['empty_+1'] = lag
df1['check'] = df1['empty'] != df1['empty_+1']
df1.loc[(df1['check'] == True)]
In mysql, one approach is to 在mysql中,一种方法是
1) add automatic incremental row-id to the table 1)在表中添加自动增量row-id
ALTER TABLE table1 ADD row_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY;
2) left join the same table with one-row shifting 2)左移同一行加入同一表
3) add selection conditions: (i) current row has 'N' empty, (ii) current row has 'Y' empty but the next row has 'N' empty 3)添加选择条件:(i)当前行为“ N”为空,(ii)当前行为“ Y”为空,而下一行为“ N”为空
SELECT a.ID, a.Empty, a.Day
FROM table1 a
LEFT JOIN table1 b ON a.row_id + 1 = b.row_id
WHERE a.Empty = 'N' or (a.Empty = 'Y' and b.Empty = 'N')
RESULT 结果
ID Empty Day
1 Y 02/01/2017
1 N 01/01/2017
2 Y 03/01/2017
3 N 03/01/2017
4 Y 03/01/2017
4 N 03/01/2017
DATA 数据
CREATE TABLE table1 (ID int, EMPTY varchar(255), DAY varchar(255));
INSERT table1 VALUES (1,'Y','03/01/2017'),(1,'Y','02/01/2017'),(1,'N','01/01/2017'),(2,'Y','03/01/2017'),(3,'N','03/01/2017'),(4,'Y','03/01/2017'),(4,'N','03/01/2017'),(4,'Y','03/01/2017'),(4,'Y','03/01/2017');
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.