简体   繁体   中英

Split a text file into words using regular expression python

I have read a file named abc.txt

Now i want to split the text of the file into words of these four categories using regular expressions.

  1. "...n't"=>"...not"
  2. Abbrevs like Mme.?
  3. Merge stutters like kk-kick
  4. Split words at hyphens.

text of the file abc.txt is this :

 **THE WIND IN THE WILLOWS BY KENNETH GRAHAME CONTENTS CHAPTER I. THE RIVER BANK II. THE OPEN ROAD III. THE WILD WOOD IV. MR. BADGER V. DULCE DOMUM VI. MR. TOAD VII. THE PIPER AT THE GATES OF DAWN VIII. TOAD'S ADVENTURES IX. WAYFARERS ALL X. THE FURTHER ADVENTURES OF TOAD XI. "LIKE SUMMER TEMPESTS CAME HIS TEARS" XII. THE RETURN OF ULYSSES

I. THE RIVER BANK

The Mole had been working very hard all the morning, spring-cleaning his little home. First with brooms, then with dusters; then on ladders and steps and chairs, with a brush and a pail of whitewash; till he had dust in his throat and eyes, and splashes of whitewash all over his black fur, and an aching back and weary arms. Spring was moving in the air above and in the earth below and around him, penetrating even his dark and lowly little house with its spirit of divine discontent and longing. It was small wonder, then, that he suddenly flung down his brush on the floor, said 'Bother!' and 'O blow!' and also 'Hang spring-cleaning!' and bolted out of the house without even waiting to put on his coat.**

What i have tried is :

import re
RE = (("([a-z])n’t\b","\1not"),("\bma’a?m\b","madam"),("W([a-z])-([a-z])","\1\2"),("-+"," "))
W = open("abc.txt","r")
W = W.read()
W

Now i am getting this output for the following :

在此处输入图片说明

What i am expecting is :

在此处输入图片说明

Try using the re.split method:

# Import regular expression operations
import re

# Text from the file
text = """** THE WIND IN THE WILLOWS
    BY KENNETH GRAHAME
    CONTENTS

    CHAPTER
    I.THE RIVER BANK
    II.THE OPEN ROAD
    III.THE WILD WOOD
    IV.MR.BADGER
    V.DULCE DOMUM
    VI.MR.TOAD
    VII.THE PIPER AT THE GATES OF DAWN
    VIII.TOAD'S ADVENTURES
    IX.WAYFARERS ALL
    X.THE FURTHER ADVENTURES OF TOAD
    XI."LIKE SUMMER TEMPESTS CAME HIS TEARS"
    XII.THE RETURN OF ULYSSES

    I.THE RIVER BANK"""

# Split text wherever one-or-more non-word characters occur
words = re.split(r'\W+', text)

which gives as result:

In [1]: words
Out[1]: ['',  'THE',  'WIND',  'IN',  'THE',  'WILLOWS',  'BY',  'KENNETH',  'GRAHAME',  'CONTENTS',  'CHAPTER',  'I',  'THE',  'RIVER',  'BANK',  'II',  'THE',  'OPEN',  'ROAD',  'III',  'THE',  'WILD',  'WOOD',  'IV',  'MR',  'BADGER',  'V',  'DULCE',  'DOMUM',  'VI',  'MR',  'TOAD',  'VII',  'THE',  'PIPER',  'AT',  'THE',  'GATES',  'OF',  'DAWN',  'VIII',  'TOAD',  'S',  'ADVENTURES',  'IX',  'WAYFARERS',  'ALL',  'X',  'THE',  'FURTHER',  'ADVENTURES',  'OF',  'TOAD',  'XI',  'LIKE',  'SUMMER',  'TEMPESTS',  'CAME',  'HIS',  'TEARS',  'XII',  'THE',  'RETURN',  'OF',  'ULYSSES',  'I',  'THE',  'RIVER',  'BANK']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM