简体   繁体   中英

How to remove every word with non alphabetic characters

I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example:

asdf@gmail.com said: I've taken 2 reports to the boss

to

taken reports to the boss

How should I proceed?

Using regular expressions to match only letters (and underscores), you can do this:

import re

s = "asdf@gmail.com said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

Try this:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss

You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character.

>>> sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

You can then use join() to make the list into one string:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss

The nltk package is specialised in handling text and has various functions you can use to 'tokenize' text into words.

You can either use the RegexpTokenizer , or the word_tokenize with a slight adaptation.

The easiest and simplest is the RegexpTokenizer :

import nltk

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

Which returns:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

Or you can use the slightly smarter word_tokenize which is able to split most contractions like didn't into did and n't .

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

which returns:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

may this will help

array = string.split(' ')
result = []
for word in array
 if word.isalpha()
  result.append(word)
string = ' '.join(result)

You can either use regex or can use python in build function such as isalpha()

Example using isalpha()

result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
    if i.isalpha():
        print(i+' ',end='')

str.join() + comprehension will give you a one line solution:

sentence = "asdf@gmail.com said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'

I ended up writing my own function for this because the regexes and isalpha() weren't working for the test cases I had.

letters = set('abcdefghijklmnopqrstuvwxyz')

def only_letters(word):
    for char in word.lower():
        if char not in letters:
            return False
    return True

# only 'asdf' is valid here
hard_words = ['ís', 'る', '<|endoftext|>', 'asdf']

print([x for x in hard_words if only_letters(x)])
# prints ['asdf']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM