How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods)

Question

I have a string of locations

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'

Note that the location names are separated by commas. But for each name with commas in between, it is enclosed in double quotation marks. Also there are prefix/suffix white spaces to be stripped.

After extracting the names into a list, the result should be:

['Los Angeles California', 'Heliopolis, Central, Cairo, Egypt', 'Berlin Germany', 'Paris France', 'Cairo, Egypt', 'Dokki, Giza, Egypt', 'Singapore']

I have tried this and it is able to get the results. But I'm laughing at my work because it looks so cumbersome

import re

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
temp = []
for strg in lis1:
    temp.extend([x.strip() for x in strg.split(',')])
lis2 = [e.strip() for e in locations.split(',')]
for strg in lis2:
    if strg.strip('"').strip() not in temp:
        lis1.append(strg)
print(lis1)

So I'm reaching out to the community... Is there a better solution using Regex or any other methods?

Answer 1

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
locations = locations.strip(',')
locations=locations.split('"')

result=[]
for i in locations:
    i = i.strip()
    i = i.rstrip(',')
    i = i.lstrip(',')
    if i=="":
        continue
    else:
        result.append(i)

print([e.strip() for e in result])

Output

['Los Angeles California',
 'Heliopolis, Central, Cairo, Egypt',
 'Berlin Germany, Paris France',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Singapore']

Answer 2

[l.strip() for l in locations.split(",")]

Answer 3

Try this (this doesn't use regex)

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, " Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'

in_string = False
out = ['']

for char in locations:
    if char == '"':
        in_string = not in_string
        continue
    if char == ',':
        if not in_string:
            out.append('')
            continue
    out[-1] += char

print([x.strip() for x in out])

Output:

['Los Angeles California', 'Heliopolis, Central, Cairo, Egypt', 'Berlin Germany', 'Cairo, Egypt', 'Dokki, Giza, Egypt', 'Singapore']

Answer 4

I have tried in javascript to resolve this issue. There is another possible solution:

Javascript:

locations = 'Los Angeles California ,"Heliopolis, Cairo, Egypt",Berlin Germany, " Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'

locations.match(/\"?([\w, ]+\"?)/gi).map(x => x = x.replace(/\"/gi,'').trim().replace(/(^\,|\,$)/g, '').replace(/\s+/g, ' ').trim()).filter(x => x)

Output:

[
  'Los Angeles California ',
  'Heliopolis, Cairo, Egypt', 
  'Berlin Germany', 
  'Cairo, Egypt', 
  'Dokki, Giza, Egypt', 
  'Singapore'
]

In Python:

import re

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
x = re.findall("\"?([\w, ]+)\"?", locations)

print ([e.strip().strip(',').strip() for e in x if len(e)>5])

Output:

[
  'Los Angeles California ',
  'Heliopolis, Cairo, Egypt', 
  'Berlin Germany', 
  'Cairo, Egypt', 
  'Dokki, Giza, Egypt', 
  'Singapore'
]

Answer 5

Here's another way to solve it

locations = 'Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore'
lis1 = [e.strip() for e in re.findall('"(.*?)"', locations)]
templis = ''.join(re.split('".*?"', locations))
lis2 = [e.strip() for e in templis.split(',') len(e.strip()) > 0]

print(lis1 + lis2)

['Heliopolis, Central, Cairo, Egypt',
 'Cairo, Egypt',
 'Dokki, Giza, Egypt',
 'Los Angeles California',
 'Berlin Germany',
 'Paris France',
 'Singapore']

Answer 6

Today I had retried and finally, I did that and got an answer in a single line.

In Javascript:

locations = `Los Angeles California ,"Heliopolis, Central, Cairo, Egypt",Berlin Germany, Paris France," Cairo, Egypt " , "Dokki, Giza, Egypt " , Singapore, "Kolkata, India", Nepal, Bhutan`;

locations.replace(/\"[\w\s, ]+\"/gi, x => x.replace(/,/g, '\\').replace(/\"/g, '').trim()).split(',').map(x => x.replace(/\\/g, ',').trim())

Output:

[
  "Los Angeles California", 
   "Heliopolis, Central, Cairo, Egypt", 
   "Berlin Germany", 
   "Paris France", 
   "Cairo, Egypt", 
   "Dokki, Giza, Egypt", 
   "Singapore", 
   "Kolkata, India", 
   "Nepal", 
   "Bhutan"
]

Explanation:

find the combination of strings between \" (double inverted commas) .
- Then replace all commas (,) with Backslash (\) : I am using backslash because it's not used in Location generally.
- remove \" (double inverted commas)
Now split the sting with comma (,) and replace Backslash (\) with comma (,)

I am able to write that in python.

str.replace(find_st, x => x.replace(find_st1, rep_st))

Because how I don't know how I express the above expression in this in Python. Basically the inner function.

Can anyone help to write the above regular expression in Python in a single line?

How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods)

Question

6 answers

solution1
0 2022-08-12 10:39:03

solution2
0 2022-08-12 10:49:34

solution3
0 2022-08-12 10:51:46

solution4
0 2022-08-12 10:56:22

solution5
0 2022-08-12 11:28:40

solution6
0 2022-08-13 20:28:51

How do I extract location names from a string with mixed commas and quotation marks? (using Regex or any other methods)

Question

6 answers

solution1 0 2022-08-12 10:39:03

solution2 0 2022-08-12 10:49:34

solution3 0 2022-08-12 10:51:46

solution4 0 2022-08-12 10:56:22

solution5 0 2022-08-12 11:28:40

solution6 0 2022-08-13 20:28:51

solution1
0 2022-08-12 10:39:03

solution2
0 2022-08-12 10:49:34

solution3
0 2022-08-12 10:51:46

solution4
0 2022-08-12 10:56:22

solution5
0 2022-08-12 11:28:40

solution6
0 2022-08-13 20:28:51