如何在python中使用正则表达式从许多文档中删除相似的页脚

Question

I am trying to clean like 18000 documents to train a word2vec classifier. 我正在尝试像18000个文档一样清洁以训练word2vec分类器。 A sample document is like : 样本文档如下：

From: shou@logos.asd.sgi.com (Tom Shou)
Subject: Ford Explorer 4WD - do I need performance axle?

We're considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing -- no off-roading):

1. With 4WD, do we need the "performance axle" - (limited slip axle).
Its purpose is to allow the tires to act independently when the tires
are on different terrain. 

2. Do we need the all-terrain tires (P235/75X15) or will the
all-season (P225/70X15) be good enough for us at Lake Tahoe?

Thanks,

Tom

-- *(there maybe --- also)*

===========================================================================*(only one of the two boundaries made with ===== may be present(like sometimes only the top boundary is present)  in some cases and may differ in length)* 

        Tom Shou            Silicon Graphics
    shou@asd.sgi.com        2011 N. Shoreline Blvd. 
    415-390-5362            MS 8U-815 
    415-962-0494 (fax)      Mountain View, CA 94043

===========================================================================

So i need to remove the footer part. 所以我需要删除页脚部分。 I am able to remove the From and Subject lines (the first two lines from the document with regex) . 我可以删除From和Subject行（使用正则表达式从文档的前两行）。 But i am not able to remove this part: 但我无法删除此部分：

-- 


    ===========================================================================

            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

    ===========================================================================

Now some footers are just having two dashes or three dashes like: 现在，某些页脚仅包含两个破折号或三个破折号，例如：

-- 


            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

or 要么

  --- 


                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

or sometimes it can have _ or + instead of ==== like : 有时它可以用_或+代替====，例如：

   --(this maybe --- or may not exist at all but then the below ______ lines will be there) 


 ________________________________________________________________________(this can be + also)

                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

 _________________________________________________________________________

I am not very good at regex but i tried to remove with a wrong regex like ((_|-|=|\\+){2,})(.|\\n)* but i didnt consider that there are overlaps of -- within the content so it removed the content also which i dont want. 我不是很擅长正则表达式，但是我尝试使用错误的正则表达式删除，例如((_|-|=|\\+){2,})(.|\\n)*但是我不认为- -在内容内，因此它也删除了我不想要的内容。 Like in the content there is a line(4th line) with 2 dashes All we would do is go skiing -- no off-roading): . 就像内容中的一行（第4行）带有2个破折号， All we would do is go skiing -- no off-roading): So it removed everything including and after -- which i dont want. 因此，它删除了包括和之后的所有内容，这是我不想要的。 I only want to remove the footer. 我只想删除页脚。

So i wanted to know what should my regex be like or what method should i use to clean the footer even if -- or --- is not always present but (there is a box made with either ______ or ++++++ or =========) or vice versa. 所以我想知道我的正则表达式应该是什么样的，或者我应该使用哪种方法来清洁页脚，即使-或---并不总是存在，但是（有一个用______或++++++制成的盒子或=========），反之亦然。

Please help thanks in advance 请事先帮助谢谢

Answer 1

You may use 您可以使用

re.sub(r'(?ms)^[ \t]*([-_=+])\1+.*', '', text)

See the regex demo 见正则表达式演示

Details 细节

(?ms) - re.M ( ^ will match the start of a line) and re.DOTALL ( . will match any chars) are enabled (?ms) - re.M （ ^将匹配行的开头）和re.DOTALL （ .将匹配任何字符）
^ - start of a line ^ -一行的开始
[ \\t]* - zero or more horizontal whitespaces (you may also use [^\\S\\r\\n]* for that) [ \\t]* -零个或多个水平空白（您也可以使用[^\\S\\r\\n]* ）
([-_=+]) - Group 1: a - , _ , = , or + ([-_=+]) -组1：a - ， _ ， =或+
\\1+ - the same char as captured into Group 1, one or more times \\1+ -与捕获到组1中相同的字符，一次或多次
.* - the rest of the string. .* -字符串的其余部分。

如何在python中使用正则表达式从许多文档中删除相似的页脚

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-11-05 12:50:29

如何在python中使用正则表达式从许多文档中删除相似的页脚

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-11-05 12:50:29

解决方案1
2 已采纳 2018-11-05 12:50:29