I have a dataframe with a column like this:
0
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
I don't know how to split into in columns, because the columns have not comma separator to do dataset[0].str.split(',', expand = True)
I try with: dataset[0].str.extract(r'((\d{1,2}) (\d{2,3}) (\d{3})|(\d{2,3}) (\d{3}))')
but only works for the first group of numbers and the output is the first column right an the other five are a combination of the first.
0 1 2 3 4 5
0 95 356 NaN NaN NaN 95 356
I think that the solution is related with RegEx, but I'm not really familliar with that. The desired outut that I would like to have is:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
IIUC, the logic here is that to group each row by three items, while considering ND
as three item:
def chunks(lst, n):
"https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks"
for i in range(0, len(lst), n):
yield lst[i:i + n]
def join(arr, n):
return pd.Series([" ".join(chunk) for chunk in chunks(arr, n)])
df["0"] = df["0"].str.replace("ND", "ND_1 ND_2 ND_3")
df2 = df["0"].str.split("\s",expand=True).fillna("").astype(str)
df2 = df2.apply(join, n=3, axis=1).replace("ND_1 ND_2 ND_3", "ND")
print(df2)
Output:
0 1 2
0 ND 95 356 618 949
1 ND 173 379 571 317
2 ND 719 451 1 040 782
3 ND 1 546 946 588 486
4 ND 3 658 146 1 317 165
5 ND 6 773 270 1 137 655
6 ND 11 148 978 1 303 481
7 14 648 890 ND ND
8 16 968 348 ND 1 436 353
9 ND ND ND
10 ND ND ND
You may use
^(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)\s+(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)$
See the regex demo . It matches
^
- start of a string (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)
- Group 1:
ND|
- ND
, or \d{1,2}(?:\s\d{3})*|
- one or two digits followed with 0 or more occurrences of a whitespace and then three digits, or \d{3}(?:\s\d{3})?
- three digits followed with an optional sequence of a whitespace and three digits \s+
- 1 or more whitespaces (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)
- Group 2: same pattern as in Group 1 \s+
- 1+ whitespaces (ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)
- Group 3: same pattern as in Group 1 $
- end of string. Note you do not need to write this long pattern by hand, define the block to match an ND
or a number and reuse it. In Python, you may use it with the Series.str.extract
Pandas method:
v = r'(ND|\d{1,2}(?:\s\d{3})*|\d{3}(?:\s\d{3})?)'
dataset[0].str.extract(fr'^{v}\s+{v}\s+{v}$', expand=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.