简体   繁体   中英

How to split merged column with blank spaces inside

I have a dataframe with a column like this:

              0
0    ND  95 356  618 949
1    ND  173 379  571 317
2    ND  719 451 1 040 782
3    ND 1 546 946  588 486
4    ND 3 658 146 1 317 165
5    ND 6 773 270 1 137 655
6    ND 11 148 978 1 303 481
7    14 648 890 ND ND
8    16 968 348 ND 1 436 353
9    ND ND ND
10   ND ND ND

I don't know how to split into in columns, because the columns have not comma separator to do dataset[0].str.split(',', expand = True) I try with: dataset[0].str.extract(r'((\d{1,2}) (\d{2,3}) (\d{3})|(\d{2,3}) (\d{3}))') but only works for the first group of numbers and the output is the first column right an the other five are a combination of the first.

       0     1   2   3   4   5
0    95 356 NaN NaN NaN 95  356

I think that the solution is related with RegEx, but I'm not really familliar with that. The desired outut that I would like to have is:

          0            1            2
0        ND          95 356     618 949
1        ND         173 379     571 317
2        ND         719 451   1 040 782
3        ND       1 546 946     588 486
4        ND       3 658 146   1 317 165
5        ND       6 773 270   1 137 655
6        ND      11 148 978   1 303 481
7    14 648 890      ND           ND
8    16 968 348      ND       1 436 353
9        ND          ND           ND
10       ND          ND           ND

IIUC, the logic here is that to group each row by three items, while considering ND as three item:

def chunks(lst, n):
    "https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks"
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

def join(arr, n):
    return pd.Series([" ".join(chunk) for chunk in chunks(arr, n)])

df["0"] = df["0"].str.replace("ND", "ND_1 ND_2 ND_3")
df2 = df["0"].str.split("\s",expand=True).fillna("").astype(str)
df2 = df2.apply(join, n=3, axis=1).replace("ND_1 ND_2 ND_3", "ND")
print(df2)

Output:

             0           1          2
0           ND      95 356    618 949
1           ND     173 379    571 317
2           ND     719 451  1 040 782
3           ND   1 546 946    588 486
4           ND   3 658 146  1 317 165
5           ND   6 773 270  1 137 655
6           ND  11 148 978  1 303 481
7   14 648 890          ND         ND
8   16 968 348          ND  1 436 353
9           ND          ND         ND
10          ND          ND         ND

You may use

^(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)$

See the regex demo . It matches

  • ^ - start of a string
  • (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 1:
    • ND| - ND , or
    • \d{1,2}(?:\s\d{3})*| - one or two digits followed with 0 or more occurrences of a whitespace and then three digits, or
    • \d{3}(?:\s\d{3})? - three digits followed with an optional sequence of a whitespace and three digits
  • \s+ - 1 or more whitespaces
  • (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 2: same pattern as in Group 1
  • \s+ - 1+ whitespaces
  • (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?) - Group 3: same pattern as in Group 1
  • $ - end of string.

Note you do not need to write this long pattern by hand, define the block to match an ND or a number and reuse it. In Python, you may use it with the Series.str.extract Pandas method:

v = r'(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)'
dataset[0].str.extract(fr'^{v}\s+{v}\s+{v}$', expand=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM