简体   繁体   中英

Split string using python and regex

I have the following csv file:

NAME   DETAILS
abc    type1: Y, Property: p1,p3 , type2:N
def    type1: Y, Property: p2,p3 , type2:N
ghi    type1: N, Property: p1,p2 , type2:Y
jkl    type1: N, Property: p1,p3 , type2:Y

I want to have the ouput file as:

NAME type1 Property type2
abc  Y      p1,p3    N
def  Y      p2,p3    N
ghi  N      p1,p2    Y
jkl  N      p1,p3    Y

Using python and regular expressions, If I split Details column based on ',' the property type splits into separate columns. Is there a way I could deal with this situation?

There are many way to do this, but I would split each line on any punctuation/whitespace character, and then reconstruct it manually based on what you desire:

import re 

t = """abc    type1: Y, Property: p1,p3 , type2:N
def    type1: Y, Property: p2,p3 , type2:N
ghi    type1: N, Property: p1,p2 , type2:Y
jkl    type1: N, Property: p1,p3 , type2:Y""".split('\n')

for x in t:
    y = re.findall(r"[\w']+", x)
    #print y
    print '\t'.join((y[0],y[2],y[4]+','+y[5],y[7]))

> abc   Y   p1,p3   N
> def   Y   p2,p3   N
> ghi   N   p1,p2   Y
> jkl   N   p1,p3   Y

Another way without regex would be to replace all delimiting characters and then reconstruct automatically. Something like this:

print [x.replace(':','\t').replace(' , ','\t').split() for x in t]

A sample script that uses regex and group capture to extract data

script.py

#!/usr/bin/env python

import re,sys

def main():
    p = re.compile("([a-z]+).*type1:\s+([A-Z]),\s+Property:\s+?([a-z0-9,]+)\s+,\s+?type2:([A-Z])")

    for line in sys.stdin:
        m = p.match(line)
        if m:   
            print "\t".join([m.group(1), m.group(2), m.group(3), m.group(4)])

if __name__ == "__main__":
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM