简体   繁体   中英

python string split on pattern

I have a long string to split.

 str1 = ' BATON ROUGE, LA -- Ascension, Assumption, East Baton Rouge, East Feliciana, Iberville, Livingston, Pointe Coupee, St. Helena, St. Mary, West Baton Rouge, West Feliciana Parishes, LA; Amite and Wilkinson Counties, MS. BEAUMONT-PORT ARTHUR, TX -- Hardin, Jasper, Jefferson, Newton, Orange,Tyler Counties, TX. '

expected outputs are:

sub1 = 'BATON ROUGE, LA -- Ascension, Assumption, East Baton Rouge, East Feliciana, Iberville, Livingston, Pointe Coupee, St. Helena, St. Mary, West Baton Rouge, West Feliciana Parishes, LA; Amite and Wilkinson Counties, MS.'
sub2 = 'BEAUMONT-PORT ARTHUR, TX -- Hardin, Jasper, Jefferson, Newton, Orange,Tyler Counties, TX.'

sub1 and sub2 contain region name and state name as well as associated county list.

If I split only by'.', there will be trouble that some county names also contain '.'. How could I split on pattern, each sub1 or sub2 should end with state aberration and '.', like here 'MS.' ,'TX.'? Thank you for your help.

You can try this:

import re
str1 = ' BATON ROUGE, LA -- Ascension, Assumption, East Baton Rouge, East Feliciana, Iberville, Livingston, Pointe Coupee, St. Helena, St. Mary, West Baton Rouge, West Feliciana Parishes, LA; Amite and Wilkinson Counties, MS. BEAUMONT-PORT ARTHUR, TX -- Hardin, Jasper, Jefferson, Newton, Orange,Tyler Counties, TX. '
new_data = re.split("(?<=\s[A-Z]{2})\.", str1)
print(new_data[0])
print(new_data[1])

Output:

BATON ROUGE, LA -- Ascension, Assumption, East Baton Rouge, East Feliciana, Iberville, Livingston, Pointe Coupee, St. Helena, St. Mary, West Baton Rouge, West Feliciana Parishes, LA; Amite and Wilkinson Counties, MS

BEAUMONT-PORT ARTHUR, TX -- Hardin, Jasper, Jefferson, Newton, Orange,Tyler Counties, TX

Regex explanation:

\\s[AZ]{2} : looks for double capital letter abbreviation ie the state abbreviation proceeded by whitespace

(?<=\\s[AZ]{2}\\. : positive look-behind, determines if . is preceded by the pattern above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM