如何刪除每個包含非字母字符的單詞

Question

我需要編寫一個 python 腳本，用非字母字符刪除文本文件中的每個單詞，以測試 Zipf 定律。 例如：

asdf@gmail.com said: I've taken 2 reports to the boss

到

taken reports to the boss

我應該如何進行？

Answer 1

使用正則表達式僅匹配字母（和下划線），您可以這樣做：

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

Answer 2

試試這個：

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss

Answer 3

您可以使用split（）和isalpha（）來獲取僅包含字母字符並且至少有一個字符的單詞列表。

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

然后，您可以使用join（）將列表轉換為一個字符串：

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss

Answer 4

nltk包專門用於處理文本，並具有各種功能，可用於將文本“標記”為單詞。

您可以使用RegexpTokenizer ，也可以稍微調整word_tokenize 。

最簡單最簡單的是RegexpTokenizer ：

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

哪個回報：

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

或者你可以使用稍微聰明word_tokenize它能夠像分裂最宮縮didn't到did和n't 。

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

返回：

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

Answer 5

這可能會有所幫助

array = string.split(' ')
result = []
for word in array
 if word.isalpha()
  result.append(word)
string = ' '.join(result)

Answer 6

您可以使用正則表達式，也可以在構建函數中使用python，例如isalpha（）

使用isalpha（）的示例

result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
    if i.isalpha():
        print(i+' ',end='')

Answer 7

str.join() + comprehension將為您提供一個單行解決方案：

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'

Answer 8

我最終為此編寫了自己的函數，因為正則表達式和isalpha()不適用於我擁有的測試用例。

letters = set('abcdefghijklmnopqrstuvwxyz')

def only_letters(word):
    for char in word.lower():
        if char not in letters:
            return False
    return True

# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']

print([x for x in hard_words if only_letters(x)])
# prints ['asdf']

如何刪除每個包含非字母字符的單詞

問題描述

8 個解決方案

解決方案1
5 已采納 2017-09-29 09:55:39

解決方案2
2 2017-09-29 09:59:21

解決方案3
2 2017-09-29 10:11:46

解決方案4
2 2017-09-29 10:58:43

解決方案5
0 2017-09-29 09:57:08

這可能會有所幫助

解決方案6
0 2017-09-29 09:59:53

解決方案7
0 2017-09-29 10:04:08

解決方案8
0 2021-06-18 02:26:46

如何刪除每個包含非字母字符的單詞

問題描述

8 個解決方案

解決方案1 5 已采納 2017-09-29 09:55:39

解決方案2 2 2017-09-29 09:59:21

解決方案3 2 2017-09-29 10:11:46

解決方案4 2 2017-09-29 10:58:43

解決方案5 0 2017-09-29 09:57:08

這可能會有所幫助

解決方案6 0 2017-09-29 09:59:53

解決方案7 0 2017-09-29 10:04:08

解決方案8 0 2021-06-18 02:26:46

解決方案1
5 已采納 2017-09-29 09:55:39

解決方案2
2 2017-09-29 09:59:21

解決方案3
2 2017-09-29 10:11:46

解決方案4
2 2017-09-29 10:58:43

解決方案5
0 2017-09-29 09:57:08

解決方案6
0 2017-09-29 09:59:53

解決方案7
0 2017-09-29 10:04:08

解決方案8
0 2021-06-18 02:26:46