简体   繁体   English

从 df.columns 中的文本中删除非英语单词包含字母和数字

[英]removing non English words from text in df.columns words contain letters and numbers

How to removing non English words from text in df.columns words contain letters and numbers如何从 df.columns 中的文本中删除非英语单词 单词包含字母和数字

Ex前任

df['text'] df['文本']

'the interiors nrd studio | 'the interiors nrd studio | happy mothers day ”there is no influence so powerful as that of the mother.”母亲节快乐“没有什么比母亲的影响力更强大了。” —sara josepha hale... happy mother's day mom & to all the mothers around the world! —sara josepha hale...母亲节快乐,妈妈和全世界所有的母亲! lots of light natasha很多光娜塔莎
0wet3bxtfl' 0wet3bxtfl'

'but still missing you every day happy mothers day francis mcclafferty (mccool) 9wlhju7cxf' '但仍然每天都想念你 母亲节快乐 francis mcclafferty (mccool) 9wlhju7cxf'

from the above 2 rows I need to remove the word '0wet3bxtfl' & '9wlhju7cxf'从上面的两行中,我需要删除“0wet3bxtfl”和“9wlhju7cxf”这个词

The example includes to retain some strings that would not be found in a list of English words ("nrd", "mcclafferty", "mccool") while removing "0wet3bxtfl" and "9wlhju7cxf", so the expected result is probably best achieved by removing any non-whitespace sequences that contain either a letter followed by digit or a digit followed by letter (together with any spaces that follow), without regard to whether words are "English" or not.该示例包括保留一些在英语单词列表(“nrd”、“mcclafferty”、“mccool”)中找不到的字符串,同时删除“0wet3bxtfl”和“9wlhju7cxf”,因此预期结果可能最好通过以下方式实现删除包含字母后跟数字或数字后跟字母(以及后面的任何空格)的任何非空白序列,而不管单词是否为“英语”。

The following would do this:以下将执行此操作:

import re

...

filtered = re.sub('[^\s]*(\d[a-zA-Z]|[a-zA-Z]\d)[^\s]* *', '', df['text'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM