![](/img/trans.png)
[英]Pandas DataFrame: programmatic rows split of a dataframe on multiple columns conditions
[英]Pandas dataframe split string into multiple columns with conditions and missing data
所以我有一個DataFrame看起來像這樣:
df = pd.DataFrame({'feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"]})
df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new
我試圖分裂列unparsed_features
成6列(體重,年齡,顏色,尺寸,價格和期限),但正如你所看到的順序混亂了,不僅如此,某些領域缺少過。
我對每列的內容可能有一個大致的了解,如下所示:
main_dict = {
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
}
理想情況下,我希望我的數據框如下所示:
df
feature1 feature2 unparsed_features weight price age \
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new
size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque
我知道第一步是用逗號分割字符串,但此后我迷路了。
df['unparsed_features'].str.split(',')
謝謝您的幫助。
由於'unparsed_features'
中的數據在每一行中的結構都不相同,因此不確定是否有簡便的方法。 一種方法是使用您定義的字典main_dict
,遍歷每個項目,並使用str.extract
和pat
參數對price
str.extract
不同:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('(\$\d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|\W)' +'|(^|\W)'.join(list_item) + ')').fillna('')
\\$\\d+
允許查找符號$
之后的任何數字,而(^|\\W)
查找list_item
任何單詞之前的空格或行首。
您將得到預期的結果:
feature1 feature2 unparsed_features weight age \
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
坦白地說,WB是正確的,您需要修改自己的字典,但是要解決以下可用數據是我的方法
for keys in main_dict:
data_list = []
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
產量
feature1 feature2 unparsed_features weight age \
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.