简体   繁体   English

用Python清理文本

[英]Text clean up in Python

I'm new to Python and can't find a way to remove useless text. 我是Python的新手,找不到删除无用文本的方法。 The main purpose is to keep the word I want and remove all the rest. 主要目的是保留我想要的单词,并删除其余所有单词。 At this stage, I can check my in_data and find the word I want. 在这个阶段,我可以检查我的in_data并找到我想要的单词。 If sentence.find(wordToCheck) is positive, then keep it. 如果句子.find(wordToCheck)是肯定的,则保留它。 The in_data is sentence each row, but the current output is a word each line. in_data每行都是一个句子,但是当前输出是每行一个单词。 What I want is remain the formats, find the word in each row and remove the rest. 我想要的是保留格式,在每一行中找到单词,然后删除其余部分。

import Orange
import orange

word = ['roaming','overseas','samsung']
out_data = []

for i in range(len(in_data)):
    for j in range(len(word)):
        sentence = str(in_data[i][0])
        wordToCheck = word[j]
        if(sentence.find(wordToCheck) >= 0):
            print wordToCheck

output 输出

roaming
overseas
roaming
overseas
roaming
overseas
samsung
samsung

The in_data is sentence like in_data是这样的句子

contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas.

I expect to see the output is like 我希望看到输出像

overseas roaming overseas

You can use regex for this: 您可以为此使用正则表达式:

>>> import re
>>> word = ['roaming','overseas','samsung']
>>> s =  "Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> pattern = r'|'.join(map(re.escape, word))
>>> re.findall(pattern, s)
['overseas', 'roaming', 'overseas']
>>> ' '.join(_)
'overseas roaming overseas'

Non-regex approach would be to use str.join with str.strip and a generator expression. 非正则表达式的方法是将str.joinstr.strip和生成器表达式一起使用。 The strip() call is required to get rid of the punctuations like '.' 需要使用strip()调用才能消除诸如'.'之类的标点符号'.' , ',' etc. ','等。

>>> from string import punctuation
>>> ' '.join(y for y in (x.strip(punctuation) for x in s.split()) if y in word)
'overseas roaming overseas'

You can do it much simpler, like this: 您可以这样做更加简单,如下所示:

for w in in_data.split():
    if w in word:
        print w

Here we first split the in_data by spaces, which returns a list of words. 在这里,我们首先用空格分隔in_data ,这将返回一个单词列表。 We then loop through each word in the in data and check if the word equals one of those you are looking for. 然后,我们遍历in数据中的每个单词,并检查该单词是否等于您要查找的单词之一。 If it does, then we print it. 如果有,那么我们将其打印出来。

And, for even faster lookup, make the word -list a set instead. 而且,为了更快地进行查找,请word -list这个word Much faster. 快多了。

In addition, if you want to handle punctuations and symbols you need to either use regex or check if all characters in the string is a letter. 此外,如果要处理标点符号和符号,则需要使用正则表达式或检查字符串中的所有字符是否都是字母。 So, to get the output you want: 因此,要获得所需的输出:

import string
in_words = ('roaming','overseas','samsung')
out_words = []

for w in in_data.split():
    w = "".join([c for c in w if c in string.letters])
    if w in in_words:
        out_words.append(w)
" ".join(out_words)

Here is a simpler way: 这是一种更简单的方法:

>>> import re
>>> i
"Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> words
['roaming', 'overseas', 'samsung']
>>> [w for w in re.findall(r"[\w']+", i) if w in words]
['overseas', 'roaming', 'overseas']

An answer using split will fall over on punctuation. 使用split的答案将落在标点符号上。 You need to break up the words with a regular expression. 您需要使用正则表达式将单词分开。

import re

in_data = "contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."

word = ['roaming','overseas','samsung']
out_data = []

word_re = re.compile(r'[^\w\']+')
for check_word in word_re.split(in_data):
  if check_word in word:
    print check_word

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM