简体   繁体   English

如何清理字符串列表

[英]How to clean a list of strings

I´m trying to clean the following data:我正在尝试清理以下数据:

from sklearn import datasets

data = datasets.fetch_20newsgroups(categories=['rec.autos', 'rec.sport.baseball', 'soc.religion.christian'])
texts, targets = data['data'], data['target']

Where texts is a list of articles and targets is a vector containing the index of the category to which each article belongs to.其中texts是文章列表, targets是包含每篇文章所属类别索引的向量。

I need to clean all articles.我需要清理所有物品。 The cleaning task means:清洁任务意味着:

  • Remove headers删除标题
  • Remove punctuation删除标点符号
  • Remove parenthesis删除括号
  • Consecutive blank spaces连续的空格
  • Tokens emails with length 1标记长度为 1 的电子邮件
  • Line breaks换行符

I'm quite new at Python but I've tried to remove all punctuation and everything using replace().我是 Python 的新手,但我尝试使用 replace() 删除所有标点符号和所有内容。 However, I think that an easy way to do this task must exist.但是,我认为必须存在一种简单的方法来完成这项任务。

def clean_articles (article):
    return ' '.join([x for x in article[article.find('\n\n'):].replace('.','').replace('[','')

clean_articles(data['data'][1])

For the following article:对于以下文章:

print(data['data'][1])

Uncleaned Article:未清洗的文章:

'From: aas7@po.CWRU.Edu (Andrew A. Spencer)\nSubject: Re: Too fast\nOrganization: Case Western Reserve University, Cleveland, OH (USA)\nLines: 25\nReply-To: aas7@po.CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\nIn a previous article, wrat@unisql.UUCP (wharfie) says:\n\n>In article <1qkon8$3re@armory.centerline.com> jimf@centerline.com (Jim Frost) writes:\n>>larger engine. '来自:aas7@po.CWRU.Edu(Andrew A. Spencer)\n主题:Re:太快了\n组织:凯斯西储大学,克利夫兰,俄亥俄州(美国)\n行:25\n回复:aas7@po。 CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\n在之前的文章中,wrat@unisql.UUCP (wharfie) 说:\n\n>在文章中<1qkon8$3re@armory.centerline.com> jimf@centerline.com (Jim Frost) 写道:\n>>更大的引擎。 That\'s what the SHO is -- a slightly modified family\n>>sedan with a powerful engine.这就是 SHO —— 一个稍加修改的家族\n>>具有强大引擎的轿车。 They didn\'t even bother improving the\n>> brakes.他们甚至没有费心改进\n>>刹车。 \n>\n>\tThat shows how much you know about anything. \n>\n>\t这表明你对任何事情的了解程度。 The brakes on the\n>SHO are very different - 9 inch (or 9.5? I forget) discs all around,\n>vented in front. \n>SHO 上的刹车非常不同 - 9 英寸(还是 9.5?我忘了)刹车盘,\n>在前面通风。 The normal Taurus setup is (smaller) discs front, \n>drums rear.\n\none i saw had vented rears too...it was on a lot.\nof course, the sales man was a fool..."titanium wheels"..yeah, right..\nthen later told me they were "magnesium"..more believable, but still\ncrap, since Al is so m uch cheaper, and just as good....\n\n\ni tend to agree, tho that this still doesn\'t take the SHO up to "standard"\nfor running 130 on a regular basis.正常的 Taurus 设置是(较小的)前圆盘,\n> 后鼓。\n\我看到的没有一个后部也有排气孔......它很多。\n当然,销售人员是个傻瓜......”钛轮“..是的,对..\n后来告诉我它们是“镁”..更可信,但仍然\ncrap,因为铝便宜得多,而且一样好....\n\n \ni 倾向于同意,尽管这仍然不能使 SHO 达到“标准”\n以定期运行 130。 The brakes should be bigger, like\n11" or so...take a look at the ones on the Corrados.(where they have\nbraking regulations).\n\nDREW\n'刹车应该更大,比如\n11" 左右......看看 Corrados 上的刹车。(他们有\n刹车规定)。\n\nDREW\n'

Cleaned Article:清洁文章:

In previous article UUCP wharfie says In article centerline com com Jim Frost writes larger engine That's what the SHO is slightly modified family sedan with powerful engine They didn't even bother improving the *brakes That shows how much you know about anything The brakes on the SHO are very different inch or forget discs all around vented in front The normal Taurus setup is smaller discs front drums rear one saw had vented rears too it was on lot of course the sales man was fool titanium wheels yeah right then later told me they were magnesium more believable but still crap since Al is so uch cheaper and just as good tend to agree tho that this still doesn't take the SHO up to standard for running 130 on regular basis The brakes should be bigger like 11 or so take look at the ones on the Corrados where they have braking regulations DREW在上一篇文章中 UUCP wharfie 说 在文章中线 com com 吉姆弗罗斯特写了更大的引擎 这就是 SHO 是稍微修改过的家庭轿车与强大的引擎 他们甚至没有费心改进*刹车 这表明你对任何事情了解多少SHO是非常不同的英寸或忘记在前面通风的光盘正常的金牛座设置是较小的光盘前鼓后一个锯子也有通风的后部当然销售人员是傻瓜钛轮是的后来告诉我他们是镁更可信,但仍然是废话,因为 Al 便宜得多,而且同样好,因此倾向于同意这仍然不能使 SHO 达到定期运行 130 的标准刹车应该更大,比如 11 左右看看那些在 Corrados 上有制动规定的人 DREW

note: this is not a complete answer, but the following will at least get you half way to:注意:这不是一个完整的答案,但以下内容至少可以帮助您:

  • remove punctuation删除标点符号
  • remove line breaks删除换行符
  • remove consecutive white space删除连续的空白
  • remove parentheses删除括号
import re
s = ';\n(a    b.,'
print('before:', s)
s = re.sub('[.,;\n(){}\[\]]', '', s)
s = re.sub('\s+', ' ', s)
print('after:', s)

this will print:这将打印:

before: ;
(a    b.,
after: a b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM