[英]How to filter dataframe columns between two rows that contain specific string in column?
I am trying to understand how to select only those rows in my dataframe that are between two specific rows.我试图了解如何 select 只有我的 dataframe 中两个特定行之间的那些行。 These rows contain two specific strings in one of the columns.
这些行在其中一列中包含两个特定的字符串。 I will explain further with this example.
我将用这个例子进一步解释。
I have the following dataframe:我有以下 dataframe:
String Value
-------------------------
0 Blue 45
1 Red 35
2 Green 75
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
9 Yellow 22
10 Red 14
There is only one instance of "Start" and only one instance of "End" in the "String" column. “String”列中只有一个“Start”实例和一个“End”实例。 I only want the rows of this dataframe that are between the rows that contain "Start" and "Stop" in the "String" column, and so I want to produce this output dataframe:
我只想要这个 dataframe 中位于“字符串”列中包含“开始”和“停止”的行之间的行,所以我想生成这个 output dataframe:
String Value
-------------------------
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
Also, I want to preserve the order of those rows I am preserving, and so preserving the order of "Start", "Orange", "Purple", "Teal", "Indigo", "End".此外,我想保留我正在保留的那些行的顺序,因此保留“开始”、“橙色”、“紫色”、“蓝绿色”、“靛蓝”、“结束”的顺序。
I know I can index these specific columns by doing:我知道我可以通过以下方式索引这些特定的列:
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
But I am not sure how to actually filter out all rows that are not between these two strings.但我不确定如何实际过滤掉不在这两个字符串之间的所有行。 How can I accomplish this in python?
我如何在 python 中完成此操作?
This should be enough, iloc[] is useful when you try to locate rows by index, and it works the same as slices in lists.这应该足够了,当您尝试按索引定位行时,iloc[] 很有用,它的工作方式与列表中的切片相同。
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
df.iloc[index_start[0]:index_end[0]+1]
More information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html更多信息: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
If both values are present you temporarily set "String" as index:如果两个值都存在,则暂时将“String”设置为索引:
df.set_index('String').loc['Start':'End'].reset_index()
output: output:
String Value
0 Start 65
1 Orange 33
2 Purple 65
3 Teal 34
4 Indigo 44
5 End 32
Alternatively, using isin
(then the order of Start/End doesn't matter):或者,使用
isin
(然后开始/结束的顺序无关紧要):
m = df['String'].isin(['Start', 'End']).cumsum().eq(1)
df[m|m.shift()]
output: output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
You can build a boolean mask using eq
+ cummax
and filter:您可以使用
eq
+ cummax
和过滤器构建一个 boolean 掩码:
out = df[df['String'].eq('Start').cummax() & df.loc[::-1, 'String'].eq('End').cummax()]
Output: Output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
As you return the index values through your work:当您通过工作返回索引值时:
df.iloc[index_start.item(): index_end.item()]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.