I have read a file named abc.txt
Now i want to split the text of the file into words of these four categories using regular expressions.
text of the file abc.txt is this :
**THE WIND IN THE WILLOWS BY KENNETH GRAHAME CONTENTS CHAPTER I. THE RIVER BANK II. THE OPEN ROAD III. THE WILD WOOD IV. MR. BADGER V. DULCE DOMUM VI. MR. TOAD VII. THE PIPER AT THE GATES OF DAWN VIII. TOAD'S ADVENTURES IX. WAYFARERS ALL X. THE FURTHER ADVENTURES OF TOAD XI. "LIKE SUMMER TEMPESTS CAME HIS TEARS" XII. THE RETURN OF ULYSSES
I. THE RIVER BANK
The Mole had been working very hard all the morning, spring-cleaning his little home. First with brooms, then with dusters; then on ladders and steps and chairs, with a brush and a pail of whitewash; till he had dust in his throat and eyes, and splashes of whitewash all over his black fur, and an aching back and weary arms. Spring was moving in the air above and in the earth below and around him, penetrating even his dark and lowly little house with its spirit of divine discontent and longing. It was small wonder, then, that he suddenly flung down his brush on the floor, said 'Bother!' and 'O blow!' and also 'Hang spring-cleaning!' and bolted out of the house without even waiting to put on his coat.**
What i have tried is :
import re
RE = (("([a-z])n’t\b","\1not"),("\bma’a?m\b","madam"),("W([a-z])-([a-z])","\1\2"),("-+"," "))
W = open("abc.txt","r")
W = W.read()
W
Now i am getting this output for the following :
What i am expecting is :
Try using the re.split
method:
# Import regular expression operations
import re
# Text from the file
text = """** THE WIND IN THE WILLOWS
BY KENNETH GRAHAME
CONTENTS
CHAPTER
I.THE RIVER BANK
II.THE OPEN ROAD
III.THE WILD WOOD
IV.MR.BADGER
V.DULCE DOMUM
VI.MR.TOAD
VII.THE PIPER AT THE GATES OF DAWN
VIII.TOAD'S ADVENTURES
IX.WAYFARERS ALL
X.THE FURTHER ADVENTURES OF TOAD
XI."LIKE SUMMER TEMPESTS CAME HIS TEARS"
XII.THE RETURN OF ULYSSES
I.THE RIVER BANK"""
# Split text wherever one-or-more non-word characters occur
words = re.split(r'\W+', text)
which gives as result:
In [1]: words
Out[1]: ['', 'THE', 'WIND', 'IN', 'THE', 'WILLOWS', 'BY', 'KENNETH', 'GRAHAME', 'CONTENTS', 'CHAPTER', 'I', 'THE', 'RIVER', 'BANK', 'II', 'THE', 'OPEN', 'ROAD', 'III', 'THE', 'WILD', 'WOOD', 'IV', 'MR', 'BADGER', 'V', 'DULCE', 'DOMUM', 'VI', 'MR', 'TOAD', 'VII', 'THE', 'PIPER', 'AT', 'THE', 'GATES', 'OF', 'DAWN', 'VIII', 'TOAD', 'S', 'ADVENTURES', 'IX', 'WAYFARERS', 'ALL', 'X', 'THE', 'FURTHER', 'ADVENTURES', 'OF', 'TOAD', 'XI', 'LIKE', 'SUMMER', 'TEMPESTS', 'CAME', 'HIS', 'TEARS', 'XII', 'THE', 'RETURN', 'OF', 'ULYSSES', 'I', 'THE', 'RIVER', 'BANK']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.