简体   繁体   中英

Cleaning text using regex library not working properly

I have a text that I need to clean for further processing.

Here is the sample text:

Nigel Reuben Rook Williams (15 July 1944 – 21 April 1992) was an English conservator and expert on the restoration of ceramics and glass. From 1961 until his death he worked at the British Museum, where he became the Chief Conservator of Ceramics and Glass in 1983. There his work included the successful restorations of the Sutton Hoo helmet and the Portland Vase.

Joining as an assistant at age 16, Williams spent his entire career, and most of his life, at the British Museum. He was one of the first people to study conservation, not yet recognised as a profession, and from an early age was given responsibility over high-profile objects. In the 1960s he assisted with the re-excavation of the Sutton Hoo ship-burial, and in his early- to mid-twenties he conserved many of the objects found therein: most notably the Sutton Hoo helmet, which occupied a year of his time. He likewise reconstructed other objects from the find, including the shield, drinking horns, and maplewood bottles.

The "abiding passion of his life" was ceramics,[4] and the 1970s and 1980s gave Williams ample opportunities in that field. After nearly 31,000 fragments of shattered Greek vases were found in 1974 amidst the wreck of HMS Colossus, Williams set to work piecing them together. The process was televised, and turned him into a television personality. A decade later, in 1988 and 1989, Williams's crowning achievement came when he took to pieces the Portland Vase, one of the most famous glass objects in the world, and put it back together. The reconstruction was again televised for a BBC programme, and as with the Sutton Hoo helmet, took nearly a year to complete.

I need to:

  • split the text into sentences (by the full stop symbol '.'), eliminating the full stop symbol

  • split the sentences into words (only latin alphabet letters), other symbols should be replaced by the space character and only single spaces should be used to separate those words

  • Show all text in lowercase

I'm using a Mac and I get this code running:

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import re

fread = open('source.txt')
fwrite = open('result.txt','w+')

for line in fread:
    new_line = line    
    # split the text into sentences
    new_line = re.sub(r"\."  , "\r", new_line)

    # change all uppercase letters to lowercase
    new_line = new_line.lower()

    # only latin letters 
    new_line = re.sub("[^a-z\s]", " ", new_line)

    # The words should be separated by single spaces.
    new_line = re.sub(r" +"," ", new_line)

    # Getting rid of space in the beginning of the sentence 
    new_line = re.sub(r"ˆ\s+", "", new_line)
    fwrite.write(new_line)

fread.close()
fwrite.close()

The result was not quite as expected. The spaces in the beginning of each line were not deleted. I ran the same code in a Windows machine and I noticed that sometines the full stop was replaced by the and some other times by . So I'm not sure what is happening.

Here is a sample of the result. Since spaces where not shown in stackoverflow, I had to show text as code:

nigel reuben rook williams july april was an english conservator and expert on the restoration of ceramics and glass
 from until his death he worked at the british museum where he became the chief conservator of ceramics and glass in 
 there his work included the successful restorations of the sutton hoo helmet and the portland vase

joining as an assistant at age williams spent his entire career and most of his life at the british museum
 he was one of the first people to study conservation not yet recognised as a profession and from an early age was given responsibility over high profile objects
 in the s he assisted with the re excavation of the sutton hoo ship burial and in his early to mid twenties he conserved many of the objects found therein most notably the sutton hoo helmet which occupied a year of his time
 he likewise reconstructed other objects from the find including the shield drinking horns and maplewood bottles

the abiding passion of his life was ceramics and the s and s gave williams ample opportunities in that field
 after nearly fragments of shattered greek vases were found in amidst the wreck of hms colossus williams set to work piecing them together
 the process was televised and turned him into a television personality
 a decade later in and williams s crowning achievement came when he took to pieces the portland vase one of the most famous glass objects in the world and put it back together
 the reconstruction was again televised for a bbc programme and as with the sutton hoo helmet took nearly a year to complete

Different characters may not appear as I see, for instance, before joining I see two ?? using TextWrangler .

Using the lstrip() function works to delete the spaces in the beginning of each sentence, by the way.

Why don't <new_line = re.sub(r"ˆ\\s+", "", new_line)> work?

I suspect that the '\\n' used to mark the end of the line is generating some problems.

# split the sentences into words 
new_line = re.sub("[^a-z\s]", " ", new_line)

This isn't doing what the comment says. It's actually replacing all non-letter, non-space characters with a space, which is why your output is missing numbers and punctuation.

# Getting rid of space in the beginning of the sentence 
new_line = re.sub(r"ˆ\s+", "", new_line)

I don't know what character is at the front of that regex, but it's not the beginning-of-line character ^ .

A few mentions here:

  1. Use context manager for in/out files because it handles closing after usage by default.

  2. You have a wrong character as John Gordon say.

  3. I recommend using some regex visualization tool (ie https://jex.im/regulex/ )

  4. Basic approach to replace something with only whitespace is to use plus operator [^az]+ : (non-alphabet chars)+(one and more).

So the final code snippet I've made

# !/usr/bin/env python
# -*- coding: utf-8 -*-
import re

# It's better to use context manager to read files.
# You don't have to explicitly close those files after reading.
with open('./source.txt', 'r') as source:
    text = ''
    for line in source:
        text += line.lower()  # Lower case on reading, why not.

# only latin letters & single spaces at the same time
text = re.sub("[^a-z.]+", " ", text)

# # replace dots with newlines
text = re.sub(r'\.', r'\n', text)

with open('./result.txt', 'w+') as output:
    output.write(text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM