![](/img/trans.png)
[英]How to split a string in python based on separator with separator as a part of one of the chunks?
[英]How to split one cell of csv into columns of dataframe based on separator
我有一個 csv 文件,所有數據都顯示在列中,我想將該列中的數字數據分成幾列。 我擁有的數據(讀取數據框后)如下所示:
0
0 13:25:09 -> mm [ -5, 4, 15 ] dd [ 4, 77, 8 ]
1 13:25:09 -> mm [ -4, 9, 10 ] dd [ 8, 6, 10 ]
2 13:25:09 -> mm [ 0, -4, 19 ] dd [ 3, 1, 66 ]
我該怎么做?
我相信你需要Series.str.extractall
和Series.unstack
:
df = df[0].str.extractall('(\d+)')[0].unstack()
print (df)
match 0 1 2 3 4 5 6 7 8
0 13 25 09 5 4 15 4 77 8
1 13 25 09 4 9 10 8 6 10
2 13 25 09 0 4 19 3 1 66
有這個 csv 文件
csvfile = '''13:25:09 -> mm [ -5, 4, 15 ] dd [ 4, 77, 8 ]
13:25:09 -> mm [ -4, 9, 10 ] dd [ 8, 6, 10 ]
13:25:09 -> mm [ 0, -4, 19 ] dd [ 3, 1, 66 ]'''
通過做
import pandas as pd
lines = csvfile.split('\n')
df = pd.DataFrame(lines)
你得到一個錯誤的結果:
0
0 13:25:09 -> mm [ -5, 4, 15 ] dd [ 4, 77, 8 ]
1 13:25:09 -> mm [ -4, 9, 10 ] dd [ 8, 6, 10 ]
2 13:25:09 -> mm [ 0, -4, 19 ] dd [ 3, 1, 66 ]
你應該做:
import pandas as pd
lines = csvfile.split('\n')
df = pd.DataFrame({'id': [1,2,3],
'time': [line[:8] for line in lines],
'mm': [line[15:30] for line in lines],
'dd': [line[34:50] for line in lines]})
你得到
id time mm dd
0 1 13:25:09 [ -5, 4, 15 ] [ 4, 77, 8 ]
1 2 13:25:09 [ -4, 9, 10 ] [ 8, 6, 10 ]
2 3 13:25:09 [ 0, -4, 19 ] [ 3, 1, 66 ]
請注意, mm將是一個字符串
print(type(df['mm'][0]))
<class 'str'>
有一個整數列表會很好
df['mm_list'] = df['mm'].str.replace('[', '').str.replace(']', '').str.split(',').values.tolist()
df['mm_list_int'] = [[int(i) for i in x] for x in df['mm_list']]
df
導致新列mm_list_int
id time mm dd mm_list mm_list_int
0 1 13:25:09 [ -5, 4, 15 ] [ 4, 77, 8 ] [ -5, 4, 15 ] [-5, 4, 15]
1 2 13:25:09 [ -4, 9, 10 ] [ 8, 6, 10 ] [ -4, 9, 10 ] [-4, 9, 10]
2 3 13:25:09 [ 0, -4, 19 ] [ 3, 1, 66 ] [ 0, -4, 19 ] [0, -4, 19]
類型正確
print(type(df['mm_list_int'][0]))
<class 'list'>
print(type(df['mm_list_int'][0][0]))
<class 'int'>
那是一個整數列表
用
objs = [df, pd.DataFrame(df['mm_list_int'].tolist(), columns=['mm_x', 'mm_y', 'mm_z'])]
df_final = pd.concat(objs, axis=1)
df_final = df_final[['id', 'time', 'mm', 'dd', 'mm_x', 'mm_y', 'mm_z']]
獲得
id time mm dd mm_x mm_y mm_z
0 1 13:25:09 [ -5, 4, 15 ] [ 4, 77, 8 ] -5 4 15
1 2 13:25:09 [ -4, 9, 10 ] [ 8, 6, 10 ] -4 9 10
2 3 13:25:09 [ 0, -4, 19 ] [ 3, 1, 66 ] 0 -4 19
對dd做同樣的事情,你就完成了
df['dd_list'] = df['dd'].str.replace('[', '').str.replace(']', '').str.split(',').values.tolist()
df['dd_list_int'] = [[int(i) for i in x] for x in df['dd_list']]
objs = [df,
pd.DataFrame(df['mm_list_int'].tolist(), columns=['mm_x', 'mm_y', 'mm_z']),
pd.DataFrame(df['dd_list_int'].tolist(), columns=['dd_x', 'dd_y', 'dd_z'])]
df_final = pd.concat(objs, axis=1)
df_final = df_final[['id', 'time', 'mm_x', 'mm_y', 'mm_z', 'dd_x', 'dd_y', 'dd_z']]
最后結果
id time mm_x mm_y mm_z dd_x dd_y dd_z
0 1 13:25:09 -5 4 15 4 77 8
1 2 13:25:09 -4 9 10 8 6 10
2 3 13:25:09 0 -4 19 3 1 66
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.