简体   繁体   中英

Python3 and regex: how to remove lines of numbers?

I have a long text file converted from a PDF and I want to remove instances of some things, eg like page numbers that will appear by themselves but possibly surrounded by spaces. I made a regex that works on short lines: eg

news1 = 'Hello done.\n4\nNext paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news1)
print(m)
Hello done. Next paragraph.

But when I try this on more complex strings, it fails, eg

news = '1   \n  Hello done. \n 4 \n  44 \n  Next paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news)
print(m)
1   
  Hello done.    44 
Next paragraph.

How do I make this work across the entire file? Should I instead read line by line and deal with it per line, instead of trying to edit the whole string?

I've also tried using the periods to match with whatever but that doesn't get the initial '1' in the more complex string. So I guess I could do 2 regexs.

m = re.sub('. *[0-9] *.', '', news)
1   
  Hello done. 


  Next paragraph.

Thoughts?

I would recommend doing it line by line unless you have some specific reason to slurp it all in as a string. Then just a few regexes to clean it all up like:

#not sure how the pages are numbered, but perhaps...
text = re.sub(r"^\s*\d+\s*$", "", text)

#chuck a line in to strip out stuff in all caps of at least 3 letters
text = re.sub(r"[A-Z]{3,}", "", text)

#concatenate multiple whitespace to 1 space, handy to clean up the data
text = re.sub(r"\s+", " ", text)

#trim the start and end of the line
text = text.strip()

Just one strategy but that's the route I would go with, easy to maintain down the road as your business side comes up with "OH OH! Can you also replace any mention of 'Cat' with 'Dog'?" I think it's easier to toubleshoot/log your changes as well. Maybe even try using re.subn to track changes... ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM