简体   繁体   English

正则表达式仅从文件中获取以字母开头的单词,并在 python 中删除仅包含数字和标点符号的单词

[英]Regular Expression to get only words from file starting with letter and removing words with only numbers and punctuation in python

I have a text file which i am reading in python through nltk functions.我有一个文本文件,我正在通过 nltk 函数在 python 中读取它。 I need to get only only words from file starting with letter and removing words with only numbers and punctuation.我只需要从文件中只获取以字母开头的单词并删除仅包含数字和标点符号的单词。 For ex :-例如:-

['Osteama pranay@123  123 !']

so the desired output is所以所需的输出是

Osteama pranay@123

Please suggest a regular expression for this请为此建议一个正则表达式

import re
' '.join(re.findall(r'\b[a-z][^\s]*\b', 'Osteama pranay@123  123 !', re.I))

the same regexp used with nltk.RegexpTokenizer与 nltk.RegexpTokenizer 使用相同的正则表达式

import nltk 
tokenizer = RegexpTokenizer(r'[a-zA-Z][^\s]*\b')
nltk.tokenize('Osteama pranay@123  123 !')

To use regular expression you need to >>>import re first要使用正则表达式,您需要先 >>>import re

import nltk,re,pprint
from __future__ import division
from nltk import word_tokenize

def openbook(self,book):
    file = open(book)
    raw = file.read()
    tokens = nltk.wordpunct_tokenize(raw)
    text = nltk.Text(tokens)
    words = [w.lower() for w in text]
    vocab = sorted(set(words))
    return vocab
if __name__ == "__main__":
    import sys
    openbook(file(sys.argv[1]))

It might help you它可能会帮助你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM