简体   繁体   English

str.contains() 的用法应用于 pandas 数据帧

[英]Usage of str.contains() applied to pandas data frame

I am new to Python and Jupyter Notebook and I am currently following this tutorial: https://www.dataquest.io/blog/jupyter-notebook-tutorial/ .我是 Python 和 Jupyter Notebook 的新手,我目前正在关注本教程: https://www.dataquest.io/blog/jupyter-notebook-tutorial/ So far I've imported the pandas library and a couple other things, and I've made a data frame 'df' which is just a CSV file of company profit and revenue data.到目前为止,我已经导入了 pandas 库和其他一些东西,并且我制作了一个数据框“df”,它只是一个公司利润和收入数据的 CSV 文件。 I'm having trouble understanding the following line of the tutorial:我无法理解本教程的以下行:

non_numberic_profits = df.profit.str.contains('[^0-9.-]')

I understand the point of what the tutorial is doing: identifying all the companies whose profit variable contains a string instead of a number.我理解本教程的重点:识别其利润变量包含字符串而不是数字的所有公司。 But I don't understand the point of [^0-9.-] and how the above function actually works.但我不明白 [^0-9.-] 的意义以及上述 function 的实际工作原理。

My full code is below.我的完整代码如下。 Thanks.谢谢。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

df = pd.read_csv('fortune500.csv')
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()

The expression [^0-9.-] is a so-called regular expression , which is a special text string for describing a search pattern.表达式[^0-9.-]是所谓的正则表达式,它是用于描述搜索模式的特殊文本字符串。 With regular expressions (or in short ' RegEx ') you can extract specific parts of a string.使用正则表达式(或简称“ RegEx ”),您可以提取字符串的特定部分。 For example, you can extract foo from the string 123foo456 .例如,您可以从字符串123foo456中提取foo

In RegEx, when using [] you define a range of characters that has to be matched.在 RegEx 中,当使用[]时,您定义了必须匹配的字符范围。 For example, [bac] matches abc in the string abcdefg .例如, [bac]匹配字符串abcdefg中的abc [bac] could also be rewritten as [ac] . [bac]也可以重写为[ac]

Using [^] you can negate a character range.使用[^]您可以否定字符范围。 Thus, the RegEx [^ac] applied to the above example would match defg .因此,应用于上述示例的 RegEx [^ac]将匹配defg

Now here is a catch:现在有一个问题:
Since ^ and - have a special meaning when used in regular expressions, they have to be put in specific positions within [] in order to be matched literally.由于^-在正则表达式中使用时具有特殊含义,因此必须将它们放在[]内的特定位置才能进行字面匹配。 Specifically, if you want to match - literally and you want to exclude it from the character range, you have to put it at the rightmost end of [] , for example [abc-] .具体来说,如果你想匹配-字面意思并且你想从字符范围中排除它,你必须把它放在[]的最右端,例如[abc-]

Putting it all together把它们放在一起
The RegEx '[^0-9.-]' means: 'Match all substrings that do not contain the digits 0 through 9, a dot ( . ) or a dash ( - )'. RegEx '[^0-9.-]'表示:'匹配所有包含数字 0 到 9、点 ( . ) 或破折号 ( - ) 的子字符串。 You can see your regular expression applied to some example strings here .您可以在此处查看应用于某些示例字符串的正则表达式。

The pandas function df.profit.str.contains('[^0-9.-]') checks whether the strings in the profit column of your DataFrame match this RegEx and returns True if they do and False if they don't. pandas function df.profit.str.contains('[^0-9.-]')检查Falseprofit列中的字符串是否匹配此 Reg 并且返回True The result is a pandas Series containing the resulting True / False values.结果是 pandas Series包含生成的True / False值。


If you're ever stuck, the Pandas docs are your friend.如果您遇到困难, Pandas 文档是您的朋友。 Stack Overflow's What Does this Regex Mean? Stack Overflow 的这个正则表达式是什么意思? and Regex 101 are also good places to start.Regex 101也是不错的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM