Cleaning column using regex remove character based on conditions

Question

I am trying to figure out how I would deal with the following situation:

I have raw data that has been manual input and several unnecessary characters and i need to clean the column.

Anything after a symbol such as (-,/,,.#) should be removed if less than 5 letters.

Raw data

NYC USA - LND UK

GBKTG-U

DUB AE- EUUSA

USA -TY

SG !S

CNZOS !C SEA

GAGAX"T

AEU DGR# UK,GBR

Desired Output

LND UK

GBKTG

EUUSA

USA

SG

CNZOS

GAGAZ

UK GBR

Answer 1

Split each line between origin and destination using the regex groups adjusting the separator ( [^\w\s] ) as needed. Next, count the number of letter on the right side of the separator symbols cheking for stated number of letters.

Details:

(.*?) : capture group - zero or more characters (except line ending) non-greddy
[^\w\s] : follow by any character that is not a letter, digit, underline ([azA-Z0-9_]) or space
(.*) : capture group - zero or more characters (except line ending)

File sample.txt used as input

NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR

import re

f = open("sample.txt", "r")
txt = f.read()

dest = []
r = re.findall(r"(.*?)[^\w\s](.*)", txt)
for f in r:
    if sum([i.isalpha() for i in f[1]]) >= 5:
        dest.append(f[1].strip())
    else:
        dest.append(f[0].strip())

print(dest)

['LND UK', 'GBKTG', 'EUUSA', 'USA', 'SG', 'CNZOS', 'GAGAX', 'UK,GBR']

Cleaning column using regex remove character based on conditions

Question

1 answers

solution1
1 ACCPTED 2021-01-02 21:52:55

Cleaning column using regex remove character based on conditions

Question

1 answers

solution1 1 ACCPTED 2021-01-02 21:52:55

solution1
1 ACCPTED 2021-01-02 21:52:55