简体   繁体   中英

Split python string by space starting at certain character

I am trying to run some basic analyses on a .txt file filled with car data. I have read in the file to Python and am trying to split it into the appropriate columns, but what will be the "first column," the car name, sometimes has multiple words. For example, below are two lines with some of the information that my file has:

  1. Car Date Color Quantity (header row)
  2. Chevy Nova 7/1/2000 Blue 28,000
  3. Cadillac 7/1/2001 Silver 30,000

Therefore, when I split each line by spaces alone, I end up with lists of different sizes--in the example above, the "Chevy" and "Nova" would be separated from one another.

I have figured out a way to identify the the portion of each line that represents the car name:

for line in cardata:
if line == line[0]: #for header line
    continue
else:
    line = line.rstrip()
    carnamebreakpoint =  line.find('7/')
    print carnamebreakpoint
    carname = line[:carnamebreakpoint]
    print carname

What I'd like to do now is tell python to split by space after the carname (with the end goal of a list that looks like [carname, date, color, number sold]), but I've tried playing around with the .split() function to do this with no luck thus far. I'd love some guidance on how to proceed, as I'm fairly new to programming.

Thanks in advance for any help!

s = "Chevy Nova 7/1/2000 Blue 28,000"  
s.rsplit(None,3)

It will only split 3 times from the end of the string:

In [4]: s = "Chevy Nova 7/1/2000 Blue 28,000"    
In [5]: s.rsplit(None,3)
Out[5]: ['Chevy Nova', '7/1/2000', 'Blue', '28,000']
In [8]: s ="Car Date Color Quantity "
In [9]: s.rsplit(None,3)
Out[9]: ['Car', 'Date', 'Color', 'Quantity']

This presumes that the last three items will always be single word strings like in your example which should be correct or else you indexing approach will also fail.

Also to ignore the header you can call next() on the file object.

with open("your_file.txt") as f:
    header = next(f)
    for line in f:
        car_name,date,col,mile = line.rstrip().rsplit(None,3)
        print(car_name,date,col,mile)
('Chevy Nova', '7/1/2000', 'Blue', '28,000')
('Cadillac', '7/1/2001', 'Silver', '30,000')

首先在断点处对字符串进行切片,然后对结果调用split()

date, color, quantity = line[breakpoint:].split()

Depending on how confident you are on the format of your data your solution might not be the best one.

What would happen if you get a car with a date different from the 7th of some month? And what about the color "Light Blue".

This kind of task fit quite well the use case for regex.

For instance given a regex of this kind would let you easily isolate the 4 components:

^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$

In python you can use it like this:

import re
s = "Chevy Nova 7/1/2000 Blue 28,000"
m = re.match(r"^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$", s)
m.group(1) # => Chevy Nova
m.group(2) # => 7/1/2000
m.group(3) # => Blue
m.group(4) # => 28,0000

And if you have a string with multiple lines you could batch process them like this:

s = """Chevy Nova 7/1/2000 Blue 28,000
Chevy Nova 10/6/2002 Light Blue 28,000
Cadillac 7/1/2001 Silver 30,000"""

re.findall(r"^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$", s, flags=re.MULTILINE)
# => [('Chevy Nova', '7/1/2000', 'Blue', '28,000'),
# =>  ('Chevy Nova', '10/6/2002', 'Light Blue', '28,000'),
# =>  ('Cadillac', '7/1/2001', 'Silver', '30,000')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM