简体   繁体   English

如何在python中使用正则表达式从许多文档中删除相似的页脚

[英]How to remove similar footer from many documents with regex in python

I am trying to clean like 18000 documents to train a word2vec classifier. 我正在尝试像18000个文档一样清洁以训练word2vec分类器。 A sample document is like : 样本文档如下:

From: shou@logos.asd.sgi.com (Tom Shou)
Subject: Ford Explorer 4WD - do I need performance axle?

We're considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing -- no off-roading):

1. With 4WD, do we need the "performance axle" - (limited slip axle).
Its purpose is to allow the tires to act independently when the tires
are on different terrain. 

2. Do we need the all-terrain tires (P235/75X15) or will the
all-season (P225/70X15) be good enough for us at Lake Tahoe?


Thanks,


Tom

-- *(there maybe --- also)*


===========================================================================*(only one of the two boundaries made with ===== may be present(like sometimes only the top boundary is present)  in some cases and may differ in length)* 

        Tom Shou            Silicon Graphics
    shou@asd.sgi.com        2011 N. Shoreline Blvd. 
    415-390-5362            MS 8U-815 
    415-962-0494 (fax)      Mountain View, CA 94043

===========================================================================

So i need to remove the footer part. 所以我需要删除页脚部分。 I am able to remove the From and Subject lines (the first two lines from the document with regex) . 我可以删除From和Subject行(使用正则表达式从文档的前两行)。 But i am not able to remove this part: 但我无法删除此部分:

-- 


    ===========================================================================

            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

    ===========================================================================

Now some footers are just having two dashes or three dashes like: 现在,某些页脚仅包含两个破折号或三个破折号,例如:

-- 


            Tom Shou            Silicon Graphics
        shou@asd.sgi.com        2011 N. Shoreline Blvd. 
        415-390-5362            MS 8U-815 
        415-962-0494 (fax)      Mountain View, CA 94043

or 要么

  --- 


                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

or sometimes it can have _ or + instead of ==== like : 有时它可以用_或+代替====,例如:

   --(this maybe --- or may not exist at all but then the below ______ lines will be there) 


 ________________________________________________________________________(this can be + also)

                Tom Shou            Silicon Graphics
            shou@asd.sgi.com        2011 N. Shoreline Blvd. 
            415-390-5362            MS 8U-815 
            415-962-0494 (fax)      Mountain View, CA 94043

 _________________________________________________________________________

I am not very good at regex but i tried to remove with a wrong regex like ((_|-|=|\\+){2,})(.|\\n)* but i didnt consider that there are overlaps of -- within the content so it removed the content also which i dont want. 我不是很擅长正则表达式,但是我尝试使用错误的正则表达式删除,例如((_|-|=|\\+){2,})(.|\\n)*但是我不认为- -在内容内,因此它也删除了我不想要的内容。 Like in the content there is a line(4th line) with 2 dashes All we would do is go skiing -- no off-roading): . 就像内容中的一行(第4行)带有2个破折号, All we would do is go skiing -- no off-roading): So it removed everything including and after -- which i dont want. 因此,它删除了包括和之后的所有内容,这是我不想要的。 I only want to remove the footer. 我只想删除页脚。

So i wanted to know what should my regex be like or what method should i use to clean the footer even if -- or --- is not always present but (there is a box made with either ______ or ++++++ or =========) or vice versa. 所以我想知道我的正则表达式应该是什么样的,或者我应该使用哪种方法来清洁页脚,即使-或---并不总是存在,但是(有一个用______或++++++制成的盒子或=========),反之亦然。

Please help thanks in advance 请事先帮助谢谢

You may use 您可以使用

re.sub(r'(?ms)^[ \t]*([-_=+])\1+.*', '', text)

See the regex demo 正则表达式演示

Details 细节

  • (?ms) - re.M ( ^ will match the start of a line) and re.DOTALL ( . will match any chars) are enabled (?ms) - re.M^将匹配行的开头)和re.DOTALL.将匹配任何字符)
  • ^ - start of a line ^ -一行的开始
  • [ \\t]* - zero or more horizontal whitespaces (you may also use [^\\S\\r\\n]* for that) [ \\t]* -零个或多个水平空白(您也可以使用[^\\S\\r\\n]*
  • ([-_=+]) - Group 1: a - , _ , = , or + ([-_=+]) -组1:a -_=+
  • \\1+ - the same char as captured into Group 1, one or more times \\1+ -与捕获到组1中相同的字符,一次或多次
  • .* - the rest of the string. .* -字符串的其余部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM