简体   繁体   English

根据行值删除 dataframe 中的行

[英]Dropping rows in dataframe based on row value

I have a few word documents which i turned into strings before reading into dataframes.我有一些 word 文档在读入数据帧之前变成了字符串。 Each dataframe is only one column wide but many rows long.每个 dataframe 只有一列宽但多行长。 they all look something like this:它们看起来都像这样:

0| this document is a survey
1| please fill in fully
2| Send back to address on the bottom of the sheet
etc....

The start of each dataframe is fully of gibberish which i don't need so i need to delete all the rows before the row which contains the value 'Questions'.每个 dataframe 的开头都是胡言乱语,我不需要,所以我需要删除包含值“问题”的行之前的所有行。 However it doesn't lie on the same index for each dataframe so i can't just delete the first 20 rows because it will have a different affect on each dataframe.但是它并不位于每个 dataframe 的相同索引上,所以我不能只删除前 20 行,因为它会对每个 dataframe 产生不同的影响。

how could i delete all the rows before 'Questions' in each dataframe我如何删除每个 dataframe 中“问题”之前的所有行

Assuming you only need to keep rows after the first occurrence of 'Questions', then this approach should do the trick:假设您只需要在第一次出现“问题”后保留行,那么这种方法应该可以解决问题:

Dummy Data and Setup虚拟数据和设置

import pandas as pd

data = {
    'x': [
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'
    ]
}

df = pd.DataFrame(data)
df

Output: Output:

    x
0   a
1   b
2   c
3   d
4   e
5   f
6   g
7   h
8   i
9   j
10  k

Solution解决方案

Here I'll keep all rows after the first occurrence of an entry that starts with the letter 'f':在这里,我将保留第一次出现以字母“f”开头的条目后的所有行:

df[df.x.str.startswith('f').cumsum() > 0]

Output: Output:

    x
5   f
6   g
7   h
8   i
9   j
10  k

Explanation解释

The solution relies on two main pandas features:该解决方案依赖于两个主要的pandas功能:

  1. pd.DataFrame().str.startswith , to get a boolean array with True for any cell that starts with a given string ('f' in this example but 'Questions' will also work). pd.DataFrame().str.startswith True为任何以给定字符串开头的单元格获取一个 boolean 数组(本例中为“f”,但“问题”也可以)。
  2. cumsum() which will cast boolean values to integers and so ensure that all rows after the first occurrence are greater than zero. cumsum()它将 boolean 值转换为整数,因此确保第一次出现之后的所有行都大于零。

By using these to index the original dataframe, we obtain the solution.通过使用这些索引原始dataframe,我们得到了解决方案。

Another alternative is to use str.contains() .另一种选择是使用str.contains() Using a toy pandas Series:使用玩具 pandas 系列:

import pandas as pd

# create dataframe
d = ["nothing", "target is here", "help", "more_words"]
df = pd.Series(data=d)

In the instance that you wanted to keep all rows (inclusive) after a word, say "here", you could do so by:如果您想在一个单词之后保留所有行(包括),请说“这里”,您可以通过以下方式执行此操作:

# check rows to determine whether they contain "here"
keyword_bool = df.str.contains("here", regex=False) 
# return index as int
idx = keyword_bool[keyword_bool==True].index[0] 

# slice dataframe
df = df.iloc[idx:]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM