正则表达式仅从文件中获取以字母开头的单词，并在 python 中删除仅包含数字和标点符号的单词

Question

I have a text file which i am reading in python through nltk functions.我有一个文本文件，我正在通过 nltk 函数在 python 中读取它。 I need to get only only words from file starting with letter and removing words with only numbers and punctuation.我只需要从文件中只获取以字母开头的单词并删除仅包含数字和标点符号的单词。 For ex :-例如：-

['Osteama pranay@123  123 !']

so the desired output is所以所需的输出是

Osteama pranay@123

Please suggest a regular expression for this请为此建议一个正则表达式

Answer 1

import re
' '.join(re.findall(r'\b[a-z][^\s]*\b', 'Osteama pranay@123  123 !', re.I))

the same regexp used with nltk.RegexpTokenizer与 nltk.RegexpTokenizer 使用相同的正则表达式

import nltk 
tokenizer = RegexpTokenizer(r'[a-zA-Z][^\s]*\b')
nltk.tokenize('Osteama pranay@123  123 !')

Answer 2

To use regular expression you need to >>>import re first要使用正则表达式，您需要先 >>>import re

import nltk,re,pprint
from __future__ import division
from nltk import word_tokenize

def openbook(self,book):
    file = open(book)
    raw = file.read()
    tokens = nltk.wordpunct_tokenize(raw)
    text = nltk.Text(tokens)
    words = [w.lower() for w in text]
    vocab = sorted(set(words))
    return vocab
if __name__ == "__main__":
    import sys
    openbook(file(sys.argv[1]))

It might help you它可能会帮助你

正则表达式仅从文件中获取以字母开头的单词，并在 python 中删除仅包含数字和标点符号的单词

问题描述

2 个解决方案

解决方案1
0 2016-09-14 15:17:58

解决方案2
0 2016-09-14 16:42:45

正则表达式仅从文件中获取以字母开头的单词，并在 python 中删除仅包含数字和标点符号的单词

问题描述

2 个解决方案

解决方案1 0 2016-09-14 15:17:58

解决方案2 0 2016-09-14 16:42:45

解决方案1
0 2016-09-14 15:17:58

解决方案2
0 2016-09-14 16:42:45