简体   繁体   English

Select 行来自 pandas DataFrame 包含以 Z157DB7ZDF5300235675289 开头的字符串

[英]Select rows from pandas DataFrame that contain a string starting with an integer

I have created a pandas DataFrame containing one string column.我创建了一个包含一个字符串列的 pandas DataFrame。 I want to copy some of its rows into a second DataFrame: just the rows where the characters before the first space are an integer greater than or equal to 300, and where the characters after the first space are "Broadway".我想将它的一些行复制到第二个 DataFrame 中:只是第一个空格之前的字符是 integer 大于或等于 300 的行,并且第一个空格之后的字符是“百老汇”。 In the following example, only the first row should be copied.在以下示例中,应仅复制第一行。

I would prefer to solve this problem without simply writing a boolean expression in straight Python.我宁愿解决这个问题,而不是简单地直接用 Python 编写 boolean 表达式。 Let's pretend I wanted to convince someone of the benefits of using pandas rather than Python without pandas.假设我想说服某人使用 pandas 而不是没有 pandas 的 Python 的好处。 Thank you very much.非常感谢。


d = {
    "address": [
        "300 Broadway",      #Ok.
        "300 Wall Street",   #Sorry, not "Broadway".
        "100-10 Broadway",   #Sorry, "100-10" is not an integer.
        "299 Broadway",      #Sorry, 299 is less than 300.
        "Broadway"           #Sorry, no space at all.
    ]
}

df = pd.DataFrame(d)
df2 = df[what goes here?]   #Broadway addresses greater than or equal to 300
print(df2)

I think is best if you clean up your data a little first, for example:我认为最好先清理一下数据,例如:

# prepare data
df[['number', 'street']] = df.address.str.split('\s+', n=1, expand=True)
df['number'] = pd.to_numeric(df.number, errors='coerce')

The first line splits the address into number and street, the second converts the number into an actual integer, note that those values that are not integers will be converted to NaN .第一行将地址拆分为数字和街道,第二行将数字转换为实际的 integer,注意那些不是整数的值将被转换为NaN Then you could do:然后你可以这样做:

# create mask to filter
mask = df.number.ge(300) & df.street.str.contains("Broadway")
print(df[mask])

Basically create a boolean mask where number is greater than or equals to 300 and the street is Broadway.基本上创建一个 boolean 掩码,其中数字大于或等于 300,街道百老汇。 Putting all together, you have:综上所述,您有:

# prepare data
df[['number', 'street']] = df.address.str.split('\s+', n=1, expand=True)
df['number'] = pd.to_numeric(df.number, errors='coerce')

# create mask to filter
mask = df.number.eq(300) & df.street.str.contains("Broadway")
print(df[mask])

Output Output

        address  number    street
0  300 Broadway   300.0  Broadway

Note that this solution assumes that your data has the pattern: Number Street .请注意,此解决方案假定您的数据具有以下模式: Number Street

You can use str.contains , str.extract and ge :您可以使用str.containsstr.extractge

# rows which contain broadway
m1 = df['address'].str.contains('(?i)broadway')
# extract the numbers from the string and check if they are greater of equal to 300
m2 = df['address'].str.extract('(\d+)')[0].astype(float).ge(300)

# get all the rows which have True for both conditions
df[m1&m2]

Output Output

        address
0  300 Broadway

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM