简体   繁体   中英

splitting a string into a multiple list

I have a large text document that I am reading in and attempting to split into a multiple list. I'm having a hard time with the logic behind actually splitting up the string.

example of the text:

Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410

This data contains 4 pieces of information in this format:

City[coordinates]Population Distances_to_previous

My aim is to split this data up into a List:

Data = [[City] , [Coordinates] , [Population] , [Distances]]

As far as I know I need to use .split statements but I've gotten lost trying to implement them.

I'd be very grateful for some ideas to get started!

I would do this in stages.

  • Your first split is at the '[' of the coordinates.
  • Your second split is at the ']' of the coordinates.
  • Third split is end of line.
  • The next line (if it starts with a number) is your distances.

I'd start with something like:

numCities = 0
Data = []

i = 0
while i < len(lines):
    split = lines[i].partition('[')
    if (split[1]): # We found something
        city = split[0]
        split = split[2].partition(']')
        if (split[1]):
            coords = split[0] #If you want this as a list then rsplit it
            population = split[2]

    distances = []
    if i > 0:
        i += 1
        distances = lines[i].rsplit(' ')

    Data.append([city, coords, population, distances])
    numCities += 1
    i += 1

for data in Data:
    print (data)

This will print

['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]

The easiest way would be with a regex.

lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""

import re

pat = re.compile(r"""
    (?P<City>.+?)                  # all characters up to the first [
    \[(?P<Coordinates>\d+,\d+)\]   # grabs [(digits,here)]
    (?P<Population>\d+)            # population digits here
    \s                             # a space or a newline?
    (?P<Distances>[\d ]+)?         # Everything else is distances""", re.M | re.X)

groups = pat.finditer(lines)
results = [[[g.group("City")],
            [g.group("Coordinates")],
            [g.group("Population")],
            g.group("Distances").split() if 
                    g.group("Distances") else [None]]
            for g in groups]

DEMO:

In[50]: results
Out[50]: 
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
 [['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
 [['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]

Though if I may, it's probably BEST to do this as a list of dictionaries.

groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
                  ["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
    try:
        d['Distances'] = d['Distances'].split()
    except AttributeError:
        # distances is None -- that's okay
        pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM