简体   繁体   English


[英]Separating non structured sentences from text corpus

I'm working on a project where I have to separate proper sentences from the text corpus. 我正在一个项目中,我必须将适当的句子与文本语料库分开。 I have tried with NLTK sentence tokenizer but it seems to tokenize sentences based on periods("."). 我已经尝试过使用NLTK句子标记器,但是它似乎基于句点(“。”)标记了句子。

So I was thinking is there any way to separate tabular data, phrases from the text file? 所以我在想有没有办法从文本文件中分离表格数据,短语?

Here is a sample text file. 这是一个示例文本文件。 I'm referring those under the TEXT tag. 我指的是TEXT标签下的那些。

<?xml version='1.0' encoding='UTF-8'?>

Record date: 2078-09-07


Name: Goldberg, Joel

MR #: 0370149 

Date of admission: 9/6/2078

Resident: Lange/Bailey

Attending: Schmidt MD

PCP: Odom, Kacie MD

CC:    L foot pain   

HPI: The patient is a 48 yo gentleman with a hx of DM2, peripheral neuropathy and PVD with multiple admissions for LE cellulitis in the setting of gangrenous toes in the past 5 years, last one in July. He now presents with acute on chronic LLE sweeling that began this morning after he got up walked around his home for about 2-3 hrs and then suddenly felt an acute pain shooting up his leg, with a severity of 10/10, he knew right away this was similar to the pain he had felt before on prior admissions for cellulitis so he called 911. On arrival to the ED his temp was 98.1, 112, 145/79, 20, 99%RA and was started on antibiotic treatment with Unasyn for cellulitis.      

ROS:  Per HPI. No F/C/NS.  No CP/Palps.  No Orthopnea. No SOB/cough/hemoptysis/wheezing/sore throat/.  No hematochezia/melena. No delta MS/LOC. No slurring of speech, unilateral weakness. No dysuria.  No chills or fevers, no lightheadedness.


1.  DM2 diagnosed in 2075, says peripheral neuropathy was diagnosed around the same time, denies any retinopathy or nephropathy.

2.  Peripheral vascular disease with the following surgeries performed:

  Right 5th toe amputation 2/2 osteomyelitis 12/14/76

Right 4th toe amputation 2/2 wet gangrene 9/03/76

Angioplasty and stenting of the distal LEFT superficial femoral artery 11/6/76

Angioplasty and stenting of the distal RIGHT superficial femoral artery 7/20/76

I&D of right thigh abscess 4/75

Medications on admission (confirmed with patient):

1.  Glyburide 2.5mg BID

2.  Glucopahge 500mg QD

3.  Zestril 2.5mg QD

4.  Percocet PRN

ALL:  Codeine upsets his stomach

SH: Lives in Arroyo Grande apartment with friend, works occasionally as a copy editor but unemployed right now, has smoked 1/2ppd for 35 years, no ETOH, no drugs. Adequate diet.

FH: Many family members with DM.    

Physical Exam:  

V:  98.5, 149/84, 98, 18, 99%RA

Gen:  NAD, conversant


Neck:  Supple, no thyromegaly, no carotid bruits, JVP 

Nodes: No cervical or supraclavicular LAN

Cor: RRR S1, S2 nl.  No m/r/g.  No S3, S4

Chest: CTAB  

Abdomen: +BS Soft, NT, ND.  No HSM, No CVA tenderness. 

Ext: LLE with dorsal and medial erythema, extending from L 5th toe that has eschar on its side and is mildly tender, no secretions. L toe also tender. Pulse on LLE + and RLE ++.

Skin: No other rashes

Neuro: AO X 3. CN II-XII intact. Decreased sensation from LT up to knee on R and 4cm above ankle on Left..

Labs and Studies:




NA          137                                                       

K           4.5(T)                                                    

CL          106                                                       

CO2         28.4                                                      

BUN         25                                                        

CRE         0.9                                                       

GLU         266(H)                                                    

CA          9.5                                                       

PHOS        3.1                                                       

MG          1.7                                                       


WBC         12.7(H)                                                   

RBC         4.13(L)                                                   

HGB         13.0(L)                                                   

HCT         37.1(L)                                                   

MCV         90                                                        

MCH         31.5                                                      

MCHC        35.0                                                      

PLT         165                                                       

RDW         13.3                                                      

DIFFR       Received                                                  

METHOD      Auto                                                      

%NEUT       79(H)                                                     

%LYMPH      17(L)                                                     

%MONO       3(L)                                                      

%EOS        1                                                         

%BASO       0                                                         

ANEUT       10.02(H)                                                  

ALYMP       2.13                                                      

AMONS       0.44(H)                                                   

AEOSN       0.11                                                      

ABASOP      0.03                                                      

ANISO       None                                                      

HYPO        None                                                      

MACRO       None                                                      

MICRO       None                                                      

PT          11.9                                                      

PTT         25.0                                                      

LENIS: Negative for DVT, did not assess arteries.

FOOT ANKLE XR: There is a lytic lesion in the distal lateral aspect of the proximal phalanx of the fifth toe. This can be consitent with an area of infection/osteomyelitis.


21-Jul-2076 09:41  

  Specimen Type:     WOUND

  Specimen Comment: ULCER  4TH 5TH TOE

  Wound Culture - Final    Reported: 24-Jul-76 15:05



      Antibiotic                      Interpretation


      Amikacin                        Susceptible   

      Ampicillin                      Resistant     

      Aztreonam                       Susceptible   

      Cefazolin                       Resistant     

      Cefepime                        Susceptible   

      Cefpodoxime                     Susceptible   

      Ceftriaxone                     Susceptible   

      Gentamicin                      Susceptible   

      Levofloxacin                    Susceptible   

      Piperacillin                    Susceptible   

      Trimethoprim/Sulfamethoxazole   Susceptible   

A/P: 48M with a hx of DM2, PVD and multiple admissions in the past for LE cellulitis in the setting of gangrene. 

1.  ID: Patient is now presenting with appears to be another episode of cellulitis but now probably coming from his L 5th Toe lesion. Surgery has debrided the wound, sending wound cultures as well as blood cultures. Acute OM would not be visible on XR changes and clinical picture is more consistent with acute than Chronic OM. Will consider further work up for OM if symptoms do not respond to treatment. Levo and flagyl were added to unasyn in accord to previous culture data.

2.  PVD: Will need arterial LENIS to assess for vascular patency and flow. Continuing ACEI, and adding ASA and lipitor, will order lipid profile and smoking cessation consult.

3.  DM2: Very poor control last admission, eventhough patient now says he takes medications and checks it up to QID. Will order HgbA1C and glucose monitoring.


Name Ian Jurado MD                               

Pager # 14558


    <MEDICATION id="DOC0" time="during DCT" type1="ACE inhibitor" type2="">
      <MEDICATION id="M0" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M1" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M2" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
    <MEDICATION id="DOC1" time="after DCT" type1="statin" type2="">
      <MEDICATION id="M3" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="statin" type2="" comment=""/>
      <MEDICATION id="M4" start="7126" end="7133" text="lipitor" time="after DCT" type1="statin" type2="" comment=""/>
    <DIABETES id="DOC2" time="before DCT" indicator="mention">
      <DIABETES id="D0" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
      <DIABETES id="D1" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
      <DIABETES id="D2" start="1180" end="1183" text="DM2" time="before DCT" indicator="mention" comment=""/>
      <DIABETES id="D3" start="6444" end="6447" text="DM2" time="before DCT" indicator="mention" comment=""/>
      <DIABETES id="D4" start="7195" end="7198" text="DM2" time="before DCT" indicator="mention" comment=""/>
      <DIABETES id="D5" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
    <MEDICATION id="DOC3" time="after DCT" type1="sulfonylureas" type2="">
      <MEDICATION id="M5" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M6" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M7" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
    <MEDICATION id="DOC4" time="after DCT" type1="metformin" type2="">
      <MEDICATION id="M8" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M9" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M10" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
    <MEDICATION id="DOC5" time="during DCT" type1="metformin" type2="">
      <MEDICATION id="M11" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M12" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M13" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
    <HYPERTENSION id="DOC6" time="during DCT" indicator="high bp">
      <HYPERTENSION id="H0" start="2100" end="2106" text="149/84" time="during DCT" indicator="high bp" comment=""/>
      <HYPERTENSION id="H1" start="828" end="834" text="145/79" time="during DCT" indicator="high bp" comment=""/>
    <MEDICATION id="DOC7" time="before DCT" type1="ACE inhibitor" type2="">
      <MEDICATION id="M14" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M15" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M16" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
    <SMOKER id="DOC8" status="current">
      <SMOKER id="S0" start="7163" end="7191" text=" smoking cessation consult. " status="current" comment=""/>
      <SMOKER id="S1" start="1965" end="1995" text="has smoked 1/2ppd for 35 years" status="current" comment=""/>
      <SMOKER id="S2" start="1969" end="1995" text="smoked 1/2ppd for 35 years" status="current" comment=""/>
    <MEDICATION id="DOC9" time="before DCT" type1="metformin" type2="">
      <MEDICATION id="M17" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M18" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
      <MEDICATION id="M19" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
    <MEDICATION id="DOC10" time="after DCT" type1="ACE inhibitor" type2="">
      <MEDICATION id="M20" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M21" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
      <MEDICATION id="M22" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
    <MEDICATION id="DOC11" time="during DCT" type1="sulfonylureas" type2="">
      <MEDICATION id="M23" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M24" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M25" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
    <DIABETES id="DOC12" time="during DCT" indicator="mention">
      <DIABETES id="D6" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
      <DIABETES id="D7" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
      <DIABETES id="D8" start="1180" end="1183" text="DM2" time="during DCT" indicator="mention" comment=""/>
      <DIABETES id="D9" start="6444" end="6447" text="DM2" time="during DCT" indicator="mention" comment=""/>
      <DIABETES id="D10" start="7195" end="7198" text="DM2" time="during DCT" indicator="mention" comment=""/>
      <DIABETES id="D11" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
    <MEDICATION id="DOC13" time="after DCT" type1="aspirin" type2="">
      <MEDICATION id="M26" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="aspirin" type2="" comment=""/>
      <MEDICATION id="M27" start="7118" end="7121" text="ASA" time="after DCT" type1="aspirin" type2="" comment=""/>
    <FAMILY_HIST id="DOC14" indicator="not present">
      <FAMILY_HIST id="F0" indicator="not present"/>
      <FAMILY_HIST id="F1" indicator="not present"/>
      <FAMILY_HIST id="F2" indicator="not present"/>
    <DIABETES id="DOC15" time="after DCT" indicator="mention">
      <DIABETES id="D12" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
      <DIABETES id="D13" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
      <DIABETES id="D14" start="1180" end="1183" text="DM2" time="after DCT" indicator="mention" comment=""/>
      <DIABETES id="D15" start="6444" end="6447" text="DM2" time="after DCT" indicator="mention" comment=""/>
      <DIABETES id="D16" start="7195" end="7198" text="DM2" time="after DCT" indicator="mention" comment=""/>
      <DIABETES id="D17" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
    <MEDICATION id="DOC16" time="before DCT" type1="sulfonylureas" type2="">
      <MEDICATION id="M28" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M29" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
      <MEDICATION id="M30" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
    <PHI id="P0" start="16" end="26" text="2078-09-07" TYPE="DATE"/>
    <PHI id="P1" start="39" end="54" text="RYBURY HOSPITAL" TYPE="HOSPITAL"/>
    <PHI id="P2" start="88" end="102" text="Goldberg, Joel" TYPE="PATIENT"/>
    <PHI id="P3" start="110" end="117" text="0370149" TYPE="MEDICALRECORD"/>
    <PHI id="P4" start="139" end="147" text="9/6/2078" TYPE="DATE"/>
    <PHI id="P5" start="159" end="164" text="Lange" TYPE="DOCTOR"/>
    <PHI id="P6" start="165" end="171" text="Bailey" TYPE="DOCTOR"/>
    <PHI id="P7" start="184" end="191" text="Schmidt" TYPE="DOCTOR"/>
    <PHI id="P8" start="201" end="212" text="Odom, Kacie" TYPE="DOCTOR"/>
    <PHI id="P9" start="267" end="269" text="48" TYPE="AGE"/>
    <PHI id="P10" start="441" end="445" text="July" TYPE="DATE"/>
    <PHI id="P11" start="1197" end="1201" text="2075" TYPE="DATE"/>
    <PHI id="P12" start="1422" end="1430" text="12/14/76" TYPE="DATE"/>
    <PHI id="P13" start="1474" end="1481" text="9/03/76" TYPE="DATE"/>
    <PHI id="P14" start="1554" end="1561" text="11/6/76" TYPE="DATE"/>
    <PHI id="P15" start="1635" end="1642" text="7/20/76" TYPE="DATE"/>
    <PHI id="P16" start="1671" end="1675" text="4/75" TYPE="DATE"/>
    <PHI id="P17" start="1866" end="1879" text="Arroyo Grande" TYPE="CITY"/>
    <PHI id="P18" start="1927" end="1938" text="copy editor" TYPE="PROFESSION"/>
    <PHI id="P19" start="2717" end="2720" text="RSC" TYPE="HOSPITAL"/>
    <PHI id="P20" start="2740" end="2748" text="09/06/78" TYPE="DATE"/>
    <PHI id="P21" start="5510" end="5521" text="21-Jul-2076" TYPE="DATE"/>
    <PHI id="P22" start="5638" end="5647" text="24-Jul-76" TYPE="DATE"/>
    <PHI id="P23" start="6427" end="6429" text="48" TYPE="AGE"/>
    <PHI id="P24" start="7431" end="7441" text="Ian Jurado" TYPE="DOCTOR"/>
    <PHI id="P25" start="7485" end="7490" text="14558" TYPE="PHONE"/>

Whenever I try to tokenize the text above on sentence basis, NLTK messes up and kind of lumps all phrases it can find before a period (".") as a sentence. 每当我尝试以句子为基础对上面的文本进行标记时,NLTK都会弄乱,并将其可以在句点(“。”)之前找到的所有短语混为一谈。

Some lines (really paragraphs) in this file contain multiple sentences. 该文件中的某些行(真正的段落)包含多个句子。 Break up the file into lines, then apply the sentence tokenizer to each line separately. 将文件分成几行,然后将句子标记器分别应用于每一行。 This will prevent merging text from different lines, and will give you much better results than rolling your own regexp-based sentence splitter. 与滚动自己的基于正则表达式的句子拆分器相比,这将防止合并不同行中的文本,并为您提供更好的结果。 For example: 例如:

text = file.read()
lines = text.splitlines()
sentences = [ s for line in lines for s in nltk.sent_tokenize(line) ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM