在熊猫数据框中使用标点识别行

Question

I have a dataframe of first names that are parsed: 我有一个解析的名字的数据框：

    **FIRST_NAME**
    Jon
    Colleen
    William
    Todd
    J.-
    &Re Inc
    123Trust

I create a column to flag a name if it is good or bad: 我创建一个列来标记名称（无论好坏）：

    df['BAD']=pd.Series(np.zeros(1),index = df.index)

    **FIRST_NAME**        **BAD**
    Jon                     0
    Colleen                 0
    William                 0
    Todd                    0
    J-Crew                  0
    &Re Inc                 0
    123Trust                0

I want to update BAD=1 if a FIRST_NAME contains punctuation, numbers, or a whitespace. 如果FIRST_NAME包含标点符号，数字或空格，我想更新BAD = 1。

    **FIRST_NAME**        **BAD**
    Jon                     0
    Colleen                 0
    William                 0
    Todd                    0
    J-Crew                  1
    &Re Inc                 1
    123Trust                1

Here is my code: 这是我的代码：

    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ 1234567890'
    i=0
    while i <int(len(dfcopy)): 
        for p in punctuation1:
            if (df['Bad'][i]==1):
                df['Bad'][i]=1
            elif(p in list(df.iloc[i,1])and df['Bad'][i]==0):
                df['Bad'][i]=1
            else:
                df['Bad'][i]=0
        i=i+1

Is there a way to do this faster? 有没有办法更快地做到这一点？

Answer 1

df['Bad'] = df.First_Name.map(lambda v: any(char in v for char in punctuation))

Another possibility: make your punctuation a set with punctuation = set(punctuation) . 另一种可能性：将标点符号设置为punctuation = set(punctuation) 。 Then you can do: 然后，您可以执行以下操作：

df['Bad'] = df.First_Name.map(lambda v: bool(set(v) & punctuation))

Also, if you really just want to know if all the characters in the string are letters, you could do: 另外，如果您真的只想知道字符串中的所有字符是否都是字母，则可以执行以下操作：

df['Bad'] = df.First_Name.map(lambda v: v.isalpha())

Answer 2

Another solution, utilizing the string capabilities of pandas' Series: 利用pandas系列的字符串功能的另一种解决方案：

In [130]: temp
Out[130]:
       index                 time  complete
row_0      2                 test         0
row_1      3  2014-10-23 14:00:00         0
row_2      4  2014-10-26 08:00:00         0
row_3      5  2014-10-26 10:00:00         0
row_4      6  2014-10-26 11:00:00         0

In [131]: temp.time.str.contains("""[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ 1234567890]""")
Out[131]:
row_0    False
row_1     True
row_2     True
row_3     True
row_4     True
Name: time, dtype: bool

In [135]: temp['is_bad'] = temp.time.str.contains("""[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~1234567890]""").astype(int)


In [136]: temp
Out[136]:
       index                 time  complete  is_bad
row_0      2                 test         0       0
row_1      3  2014-10-23 14:00:00         0       1
row_2      4  2014-10-26 08:00:00         0       1
row_3      5  2014-10-26 10:00:00         0       1
row_4      6  2014-10-26 11:00:00         0       1

pandas.Series.str.contains can accept a regex pattern to match against pandas.Series.str.contains可以接受正则表达式模式进行匹配

在熊猫数据框中使用标点识别行

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-10-27 19:08:24

解决方案2
0 2014-10-30 04:23:05

在熊猫数据框中使用标点识别行

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-10-27 19:08:24

解决方案2 0 2014-10-30 04:23:05

解决方案1
2 已采纳 2014-10-27 19:08:24

解决方案2
0 2014-10-30 04:23:05