简体   繁体   English

Pandas:如何从列中删除数字和特殊字符

[英]Pandas: How to remove numbers and special characters from a column

I am trying to remove all characters except alpha and spaces from a column, but when i am using the code to perform the same, it gives output as 'nan' in place of NaN (Null values)我正在尝试从列中删除除 alpha 和空格之外的所有字符,但是当我使用代码执行相同操作时,它会将 output 作为'nan'代替NaN (空值)

Input data:输入数据:

col1

ABC ad
YQW \2
AQ4 GH
@34
#45
NaN

Expected output:预期 output:

col1

ABC ad
YQW
AQ GH
NaN
NaN
NaN

Code i have been using:我一直在使用的代码:

df['col1'] = df['col1'].astype(str).str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)')

Later i am using this column to check the condition for NaN but its not giving as after executing the above script it changes the NaN values to 'nan' .后来我使用此列来检查NaN的条件,但它没有给出,因为在执行上述脚本后它将NaN值更改为'nan'

Note: without casting to string by .astype(str) , my data will get注意:如果不通过.astype(str)转换为字符串,我的数据将得到

AttributeError: Can only use.str accessor with string values! AttributeError:只能使用带有字符串值的.str 访问器!

You can do it by the following steps:您可以通过以下步骤进行操作:

  1. Firstly, replace NaN value by empty string (which we may also get after removing characters and will be converted back to NaN afterwards).首先,将NaN值替换为空字符串(我们也可能在删除字符后得到,之后会转换回NaN )。
  2. Cast the column to string type by .astype(str) for in case some elements are non-strings in the column.通过.astype(str)将列转换为字符串类型,以防某些元素在列中是非字符串。
  3. Replace non alpha and non blank to empty string by str.replace() with regex用正则表达式通过str.replace()将非 alpha 和非空白替换为空字符串
  4. Finally, replace empty string to NaN by .replace()最后,通过.replace()将空字符串替换为NaN

(Note: The first 2 steps are to special handle for OP's problem of getting AttributeError: Can only use.str accessor with string values! although my testing of specially adding integer and float (not integer and float in string but real numeric values) also got no problem without the first 2 steps. Maybe some other special data types??) Other users without the same problem can use only the last 2 steps starting with str.replace() . (注意:前两个步骤是针对 OP 获取AttributeError: Can only use.str accessor with string values!虽然我测试了专门添加 integer 和浮点数(不是 integer 和浮点数,而是实际数值)没有前两个步骤没有问题。也许其他一些特殊的数据类型??)没有相同问题的其他用户只能使用从str.replace()开始的最后两个步骤。


df['col1'] = df['col1'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)

Result:结果:

print(df)


     col1
0  ABC ad
1    YQW 
2   AQ GH
3     NaN
4     NaN
5     NaN


Note that we cannot use .extract() here and have to use .replace() to get rid of the unwanted characters.请注意,我们不能在这里使用.extract()并且必须使用.replace()来删除不需要的字符。 How about a string like ' ab c1d2@ ef4'?像'ab c1d2@ef4'这样的字符串怎么样? What regex pattern to use to extract only the alphabets and spaces leaving behind the numbers and special characters?使用什么正则表达式模式仅提取留下数字和特殊字符的字母和空格? And don't forget we have to consider the generic cases, not just the sample data here.不要忘记我们必须考虑一般情况,而不仅仅是这里的示例数据。 Can we quote all possible patterns of regex here to handle the infinite numbers of combinations of such alpha, space, number and special character patterns?我们可以在这里引用所有可能的正则表达式模式来处理这些字母、空格、数字和特殊字符模式的无限组合吗?

Another way is to extract alphanumerics but exclude numerals.另一种方法是提取字母数字但排除数字。 See code below请参阅下面的代码

df['col1']=df['col1'].str.extract('(\w+\s\w+[^0-9]|\w+[^0-9])')

    col1
0  ABC ad
1    YQW 
2  AQ4 GH
3     NaN
4     NaN
5     NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM