简体   繁体   English

使用正则表达式python将文本文件拆分为单词

[英]Split a text file into words using regular expression python

I have read a file named abc.txt我读过一个名为 abc.txt 的文件

Now i want to split the text of the file into words of these four categories using regular expressions.现在我想使用正则表达式将文件的文本拆分为这四个类别的单词。

  1. "...n't"=>"...not" "...不"=>"...不"
  2. Abbrevs like Mme.?像 Mme 一样的缩写。?
  3. Merge stutters like kk-kick合并像 kk-kick 这样的口吃
  4. Split words at hyphens.在连字符处拆分单词。

text of the file abc.txt is this :文件 abc.txt 的文本是这样的:

 **THE WIND IN THE WILLOWS BY KENNETH GRAHAME CONTENTS CHAPTER I. THE RIVER BANK II. THE OPEN ROAD III. THE WILD WOOD IV. MR. BADGER V. DULCE DOMUM VI. MR. TOAD VII. THE PIPER AT THE GATES OF DAWN VIII. TOAD'S ADVENTURES IX. WAYFARERS ALL X. THE FURTHER ADVENTURES OF TOAD XI. "LIKE SUMMER TEMPESTS CAME HIS TEARS" XII. THE RETURN OF ULYSSES

I. THE RIVER BANK一、河岸

The Mole had been working very hard all the morning, spring-cleaning his little home.鼹鼠整个上午都在努力工作,在春季大扫除他的小房子。 First with brooms, then with dusters;先用扫帚,然后用掸子; then on ladders and steps and chairs, with a brush and a pail of whitewash;然后在梯子、台阶和椅子上,用刷子和一桶粉刷; till he had dust in his throat and eyes, and splashes of whitewash all over his black fur, and an aching back and weary arms.直到他的喉咙和眼睛里满是灰尘,黑色的皮毛上溅满了粉饰,背部疼痛,手臂疲惫。 Spring was moving in the air above and in the earth below and around him, penetrating even his dark and lowly little house with its spirit of divine discontent and longing.春天在他上方和下方的大地中和他周围的空气中移动,它甚至以神圣的不满和渴望的精神渗透到他黑暗而卑微的小房子里。 It was small wonder, then, that he suddenly flung down his brush on the floor, said 'Bother!'难怪他突然把刷子扔在地板上,说:“打扰了!” and 'O blow!'和“哦,吹!” and also 'Hang spring-cleaning!'还有“挂春季大扫除!” and bolted out of the house without even waiting to put on his coat.**甚至没等他穿上外套就狂奔出门。**

What i have tried is :我尝试过的是:

import re
RE = (("([a-z])n’t\b","\1not"),("\bma’a?m\b","madam"),("W([a-z])-([a-z])","\1\2"),("-+"," "))
W = open("abc.txt","r")
W = W.read()
W

Now i am getting this output for the following :现在我得到以下输出:

在此处输入图片说明

What i am expecting is :我期待的是:

在此处输入图片说明

Try using the re.split method:尝试使用re.split方法:

# Import regular expression operations
import re

# Text from the file
text = """** THE WIND IN THE WILLOWS
    BY KENNETH GRAHAME
    CONTENTS

    CHAPTER
    I.THE RIVER BANK
    II.THE OPEN ROAD
    III.THE WILD WOOD
    IV.MR.BADGER
    V.DULCE DOMUM
    VI.MR.TOAD
    VII.THE PIPER AT THE GATES OF DAWN
    VIII.TOAD'S ADVENTURES
    IX.WAYFARERS ALL
    X.THE FURTHER ADVENTURES OF TOAD
    XI."LIKE SUMMER TEMPESTS CAME HIS TEARS"
    XII.THE RETURN OF ULYSSES

    I.THE RIVER BANK"""

# Split text wherever one-or-more non-word characters occur
words = re.split(r'\W+', text)

which gives as result:结果如下:

In [1]: words
Out[1]: ['',  'THE',  'WIND',  'IN',  'THE',  'WILLOWS',  'BY',  'KENNETH',  'GRAHAME',  'CONTENTS',  'CHAPTER',  'I',  'THE',  'RIVER',  'BANK',  'II',  'THE',  'OPEN',  'ROAD',  'III',  'THE',  'WILD',  'WOOD',  'IV',  'MR',  'BADGER',  'V',  'DULCE',  'DOMUM',  'VI',  'MR',  'TOAD',  'VII',  'THE',  'PIPER',  'AT',  'THE',  'GATES',  'OF',  'DAWN',  'VIII',  'TOAD',  'S',  'ADVENTURES',  'IX',  'WAYFARERS',  'ALL',  'X',  'THE',  'FURTHER',  'ADVENTURES',  'OF',  'TOAD',  'XI',  'LIKE',  'SUMMER',  'TEMPESTS',  'CAME',  'HIS',  'TEARS',  'XII',  'THE',  'RETURN',  'OF',  'ULYSSES',  'I',  'THE',  'RIVER',  'BANK']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM