[英]How to remove every word with non alphabetic characters
I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law.我需要编写一个 python 脚本,用非字母字符删除文本文件中的每个单词,以测试 Zipf 定律。 For example:
例如:
asdf@gmail.com said: I've taken 2 reports to the boss
to到
taken reports to the boss
How should I proceed?我应该如何进行?
Using regular expressions to match only letters (and underscores), you can do this: 使用正则表达式仅匹配字母(和下划线),您可以这样做:
import re
s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
Try this: 试试这个:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']
result = ' '.join(words)
# taken reports to the boss
You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character. 您可以使用split()和isalpha()来获取仅包含字母字符并且至少有一个字符的单词列表。
>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
You can then use join() to make the list into one string: 然后,您可以使用join()将列表转换为一个字符串:
>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
The nltk
package is specialised in handling text and has various functions you can use to 'tokenize' text into words. nltk
包专门用于处理文本,并具有各种功能,可用于将文本“标记”为单词。
You can either use the RegexpTokenizer
, or the word_tokenize
with a slight adaptation. 您可以使用
RegexpTokenizer
,也可以稍微调整word_tokenize
。
The easiest and simplest is the RegexpTokenizer
: 最简单最简单的是
RegexpTokenizer
:
import nltk
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
Which returns: 哪个回报:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
Or you can use the slightly smarter word_tokenize
which is able to split most contractions like didn't
into did
and n't
. 或者你可以使用稍微聪明
word_tokenize
它能够像分裂最宫缩didn't
到did
和n't
。
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
which returns: 返回:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
array = string.split(' ')
result = []
for word in array
if word.isalpha()
result.append(word)
string = ' '.join(result)
You can either use regex or can use python in build function such as isalpha() 您可以使用正则表达式,也可以在构建函数中使用python,例如isalpha()
Example using isalpha() 使用isalpha()的示例
result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
if i.isalpha():
print(i+' ',end='')
str.join()
+ comprehension will give you a one line solution: str.join()
+ comprehension将为您提供一个单行解决方案:
sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'
I ended up writing my own function for this because the regexes and isalpha()
weren't working for the test cases I had.我最终为此编写了自己的函数,因为正则表达式和
isalpha()
不适用于我拥有的测试用例。
letters = set('abcdefghijklmnopqrstuvwxyz')
def only_letters(word):
for char in word.lower():
if char not in letters:
return False
return True
# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']
print([x for x in hard_words if only_letters(x)])
# prints ['asdf']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.