简体   繁体   中英

Cleaning column using regex remove character based on conditions

I am trying to figure out how I would deal with the following situation:

I have raw data that has been manual input and several unnecessary characters and i need to clean the column.

Anything after a symbol such as (-,/,,.#) should be removed if less than 5 letters.

Raw data

NYC USA - LND UK

GBKTG-U

DUB AE- EUUSA

USA -TY

SG !S

CNZOS !C SEA

GAGAX"T

AEU DGR# UK,GBR

Desired Output

LND UK

GBKTG

EUUSA

USA

SG

CNZOS

GAGAZ

UK GBR

Split each line between origin and destination using the regex groups adjusting the separator ( [^\w\s] ) as needed. Next, count the number of letter on the right side of the separator symbols cheking for stated number of letters.

Details:

  • (.*?) : capture group - zero or more characters (except line ending) non-greddy
  • [^\w\s] : follow by any character that is not a letter, digit, underline ([azA-Z0-9_]) or space
  • (.*) : capture group - zero or more characters (except line ending)

File sample.txt used as input

NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
import re

f = open("sample.txt", "r")
txt = f.read()

dest = []
r = re.findall(r"(.*?)[^\w\s](.*)", txt)
for f in r:
    if sum([i.isalpha() for i in f[1]]) >= 5:
        dest.append(f[1].strip())
    else:
        dest.append(f[0].strip())

print(dest)
['LND UK', 'GBKTG', 'EUUSA', 'USA', 'SG', 'CNZOS', 'GAGAX', 'UK,GBR']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM