I am trying to figure out how I would deal with the following situation:
I have raw data that has been manual input and several unnecessary characters and i need to clean the column.
Anything after a symbol such as (-,/,,.#) should be removed if less than 5 letters.
Raw data
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
Desired Output
LND UK
GBKTG
EUUSA
USA
SG
CNZOS
GAGAZ
UK GBR
Split each line between origin
and destination
using the regex groups adjusting the separator
( [^\w\s]
) as needed. Next, count the number of letter on the right side of the separator symbols cheking for stated number of letters.
Details:
(.*?)
: capture group - zero or more characters (except line ending) non-greddy [^\w\s]
: follow by any character that is not a letter, digit, underline ([azA-Z0-9_]) or space (.*)
: capture group - zero or more characters (except line ending) File sample.txt used as input
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
import re
f = open("sample.txt", "r")
txt = f.read()
dest = []
r = re.findall(r"(.*?)[^\w\s](.*)", txt)
for f in r:
if sum([i.isalpha() for i in f[1]]) >= 5:
dest.append(f[1].strip())
else:
dest.append(f[0].strip())
print(dest)
['LND UK', 'GBKTG', 'EUUSA', 'USA', 'SG', 'CNZOS', 'GAGAX', 'UK,GBR']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.