简体   繁体   English

使用正则表达式按行过滤熊猫数据框

[英]Filter pandas dataframe by row with regex

I'm sure there might be a simple solution but I'm quite new to Python. 我敢肯定可能有一个简单的解决方案,但我对Python还是很陌生。 I have a Pandas DataFrame with strings and NaN values. 我有一个带有字符串和NaN值的Pandas DataFrame In this Dataframe I want to search for special parts of strings. 在此数据框中,我想搜索字符串的特殊部分。 This should be done row by row and the found strings will be written in a list with the same number of rows as the Dataframe (means if the partial string I was looking for could not be matched in the row the entry in the list should be 'none'). 这应该逐行完成,找到的字符串将被写入与Dataframe具有相同行数的列表中(这意味着如果我要查找的部分字符串在该行中无法匹配,则列表中的条目应为'没有')。

I tried: result.loc[result[0].str.contains("hello", na=False)] but this only gives me back the rows where first column contains the word hello... 我试过: result.loc[result[0].str.contains("hello", na=False)]但这只给我返回第一列包含hello字样的行。

I was thinking about a for loop searching with regular expressions in every row: 我正在考虑在每行中使用正则表达式进行for循环搜索:

row = df.iloc[0:100]
for item in row:
    row_dict={}
    hello = re.search(r"hello.*", item)
    if hello is None:
       hello = "NaN"

Maybe there is also a simpler way? 也许还有一种更简单的方法? Thank you! 谢谢!

For the test purpose, I defined the source DataFrame as: 出于测试目的,我将源DataFrame定义为:

df = pd.DataFrame(data=[
    ['Halo Mike', 'How are you?', np.nan],
    ['Hello John', 'Good morning', 'What a nice day'],
    ['Ello Jack', 'Xyz hello abc', np.nan]])

As you can see, there are 2 elements containing hello and 2 NaN elements. 如您所见,有2个元素包含hello和2个NaN元素。 Column names are not essential here, so I didn't define them. 列名在这里不是必需的,因此我没有定义它们。

The first step is to convert this DataFrame into a Series , with NaN values filtered out: 第一步是将此DataFrame转换为Series ,并过滤掉NaN值:

ser = pd.Series(data=df.values.flatten()).dropna()

df.values gets the underlying Numpy array, flatten reshapes it to a 1-D array and dropna deletes NaN values. df.values获取基础的Numpy数组, flatten将其dropna为一维数组, dropna删除NaN值。

Then, to get elements of this Series with hello inside (case insensitive), run: 然后,要获取内部带有hello的本系列元素(不区分大小写),请运行:

ser[ser.str.contains('hello', case=False)].tolist()

In case of our test data, the result is: 对于我们的测试数据,结果为:

['Hello John', 'Xyz hello abc']

I think, it just what you described in your comment. 我认为,这正是您在评论中所描述的。

For real input data (longer than my example), if you want to limit the search to just 100 initial rows, change df.values to df.head(100).values . 对于实际输入数据(比我的示例更长),如果要将搜索限制为仅100个初始行, df.values df.head(100).values更改为df.head(100).values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM