简体   繁体   English

如何删除每个包含非字母字符的单词

[英]How to remove every word with non alphabetic characters

I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law.我需要编写一个 python 脚本,用非字母字符删除文本文件中的每个单词,以测试 Zipf 定律。 For example:例如:

asdf@gmail.com said: I've taken 2 reports to the boss

to

taken reports to the boss

How should I proceed?我应该如何进行?

Using regular expressions to match only letters (and underscores), you can do this: 使用正则表达式仅匹配字母(和下划线),您可以这样做:

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

Try this: 试试这个:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss

You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character. 您可以使用split()isalpha()来获取仅包含字母字符并且至少有一个字符的单词列表。

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

You can then use join() to make the list into one string: 然后,您可以使用join()将列表转换为一个字符串:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss

The nltk package is specialised in handling text and has various functions you can use to 'tokenize' text into words. nltk包专门用于处理文本,并具有各种功能,可用于将文本“标记”为单词。

You can either use the RegexpTokenizer , or the word_tokenize with a slight adaptation. 您可以使用RegexpTokenizer ,也可以稍微调整word_tokenize

The easiest and simplest is the RegexpTokenizer : 最简单最简单的是RegexpTokenizer

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

Which returns: 哪个回报:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

Or you can use the slightly smarter word_tokenize which is able to split most contractions like didn't into did and n't . 或者你可以使用稍微聪明word_tokenize它能够像分裂最宫缩didn'tdidn't

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

which returns: 返回:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

may this will help 这可能会有所帮助

array = string.split(' ')
result = []
for word in array
 if word.isalpha()
  result.append(word)
string = ' '.join(result)

You can either use regex or can use python in build function such as isalpha() 您可以使用正则表达式,也可以在构建函数中使用python,例如isalpha()

Example using isalpha() 使用isalpha()的示例

result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
    if i.isalpha():
        print(i+' ',end='')

str.join() + comprehension will give you a one line solution: str.join() + comprehension将为您提供一个单行解决方案:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'

I ended up writing my own function for this because the regexes and isalpha() weren't working for the test cases I had.我最终为此编写了自己的函数,因为正则表达式和isalpha()不适用于我拥有的测试用例。

letters = set('abcdefghijklmnopqrstuvwxyz')

def only_letters(word):
    for char in word.lower():
        if char not in letters:
            return False
    return True

# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']

print([x for x in hard_words if only_letters(x)])
# prints ['asdf']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM