[英]How to remove similar footer from many documents with regex in python
I am trying to clean like 18000 documents to train a word2vec classifier. 我正在尝试像18000个文档一样清洁以训练word2vec分类器。 A sample document is like : 样本文档如下:
From: shou@logos.asd.sgi.com (Tom Shou)
Subject: Ford Explorer 4WD - do I need performance axle?
We're considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing -- no off-roading):
1. With 4WD, do we need the "performance axle" - (limited slip axle).
Its purpose is to allow the tires to act independently when the tires
are on different terrain.
2. Do we need the all-terrain tires (P235/75X15) or will the
all-season (P225/70X15) be good enough for us at Lake Tahoe?
Thanks,
Tom
-- *(there maybe --- also)*
===========================================================================*(only one of the two boundaries made with ===== may be present(like sometimes only the top boundary is present) in some cases and may differ in length)*
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
===========================================================================
So i need to remove the footer part. 所以我需要删除页脚部分。 I am able to remove the From and Subject lines (the first two lines from the document with regex) . 我可以删除From和Subject行(使用正则表达式从文档的前两行)。 But i am not able to remove this part: 但我无法删除此部分:
--
===========================================================================
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
===========================================================================
Now some footers are just having two dashes or three dashes like: 现在,某些页脚仅包含两个破折号或三个破折号,例如:
--
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
or 要么
---
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
or sometimes it can have _ or + instead of ==== like : 有时它可以用_或+代替====,例如:
--(this maybe --- or may not exist at all but then the below ______ lines will be there)
________________________________________________________________________(this can be + also)
Tom Shou Silicon Graphics
shou@asd.sgi.com 2011 N. Shoreline Blvd.
415-390-5362 MS 8U-815
415-962-0494 (fax) Mountain View, CA 94043
_________________________________________________________________________
I am not very good at regex but i tried to remove with a wrong regex like ((_|-|=|\\+){2,})(.|\\n)*
but i didnt consider that there are overlaps of -- within the content so it removed the content also which i dont want. 我不是很擅长正则表达式,但是我尝试使用错误的正则表达式删除,例如((_|-|=|\\+){2,})(.|\\n)*
但是我不认为- -在内容内,因此它也删除了我不想要的内容。 Like in the content there is a line(4th line) with 2 dashes All we would do is go skiing -- no off-roading):
. 就像内容中的一行(第4行)带有2个破折号, All we would do is go skiing -- no off-roading):
So it removed everything including and after -- which i dont want. 因此,它删除了包括和之后的所有内容,这是我不想要的。 I only want to remove the footer. 我只想删除页脚。
So i wanted to know what should my regex be like or what method should i use to clean the footer even if -- or --- is not always present but (there is a box made with either ______ or ++++++ or =========) or vice versa. 所以我想知道我的正则表达式应该是什么样的,或者我应该使用哪种方法来清洁页脚,即使-或---并不总是存在,但是(有一个用______或++++++制成的盒子或=========),反之亦然。
Please help thanks in advance 请事先帮助谢谢
You may use 您可以使用
re.sub(r'(?ms)^[ \t]*([-_=+])\1+.*', '', text)
See the regex demo 见正则表达式演示
Details 细节
(?ms)
- re.M
( ^
will match the start of a line) and re.DOTALL
( .
will match any chars) are enabled (?ms)
- re.M
( ^
将匹配行的开头)和re.DOTALL
( .
将匹配任何字符) ^
- start of a line ^
-一行的开始 [ \\t]*
- zero or more horizontal whitespaces (you may also use [^\\S\\r\\n]*
for that) [ \\t]*
-零个或多个水平空白(您也可以使用[^\\S\\r\\n]*
) ([-_=+])
- Group 1: a -
, _
, =
, or +
([-_=+])
-组1:a -
, _
, =
或+
\\1+
- the same char as captured into Group 1, one or more times \\1+
-与捕获到组1中相同的字符,一次或多次 .*
- the rest of the string. .*
-字符串的其余部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.