简体   繁体   中英

Using regex to try and extract paragraphs from emails

I'm trying to use regex to extract paragraphs from text which is of a form like the following:

<0.30.1.92.13.39.38.marian+@MARIAN.ADM.CS.CMU.EDU (Marian D'Amico).0>
Type:     cmu.cs.scs
Topic:    LOGIC COLLOQUIUM
Dates:    6-Feb-92
Time:     3:30
Host:     Stephen D. Brookes
PostedBy: marian+ on 30-Jan-92 at 13:39 from MARIAN.ADM.CS.CMU.EDU 
(Marian D'Amico)
Abstract: 



***********************************************************************
          Logic Colloquium
            Thursday February 6
           3:30 Wean 5409
 **********************************************************************
       On The Mathematics of Non-monotonic Reasoning
          Menachem Magidor
       Hebrew University of Jerusalem
          (Joint work with Daniel Lehman)

Non-monotonic reasoning is an attempt to develop reasoning systems
where an inference means that the conclusion holds in the "normal 
case",
in "most cases", but it does not necessarily hold in all cases. It 
seems 
that this type of reasoning is needed if one wants to model everyday
common-sense reasoning. There have been many models suggested for
non-monotonic reasoning (like circumscription, default logic, 
autoepistemic logic, etc). We study all these approaches in a more 
abstract fashion by considering the inference relation of the 
reasoning system, and clarify the role of different inference rules 
and the impact they have on the model theory of the logic. We are 
especially interested in a particular rule called "Rational Monotony" 
and the connection between it and probabilistic models.

 NOTE: Prof. Magidor will also give a Math Department Colloquium on 
Friday
 February 7.

-------------------------
 Host:  Stephen D. Brookes

Appointments can be made through Marian D'Amico, marian@cs, x7665.

I'm currently trying: paragraphRegex = r'(?<=\\n\\n)(?:(?:\\s*\\b.+\\b:(?:.|\\s)+?)|(\\s{0,4}A-Za-z0-9+? \\s*))(?=\\n\\n)'

however this regex captures some cases and in others, it either doesn't capture a paragraph or it hangs.

Any help would be appreciated

I would try a different approach.

You can break apart the text based on a new line:

texts = text.split('\n')

From there, develop a test to determine if the text is part of the body of the email or something else. Perhaps look for text chunks where the leading/following lines are blank. Something like this might work:

paragraphs = []

for i, text in enumerate(texts):
  if i>0:
    if (text != '' and texts[i-1] == '' and texts[i+1]):
       paragraphs.append(text)

By the way, you can only get so far using regexp. There is usually a lot of variation in the format of most text data sources and your regular expression will never be able to capture every edge case. I had to do this once and it was more robust (and easier) to build a classification model to identify the paragraphs.

That is it's own research project but if you go that way take a look at pairing term frequency - inverse document frequency (TF-IDF) with Support Vector Classifiers (SVC) and don't let anyone talk you into using neural nets unless you have a lot of good training data :).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM