[英]pandas rearranging a data frame
I have a data frame that is as follows: 我有一个数据框如下:
Honda [edit]
Accord (4 models)
Civic (4 models)
Pilot (3 models)
Toyota [edit]
Prius (4 models)
Highlander (3 models)
Ford [edit]
Explorer (2 models)
I am looking to reshape it such that I get a resulting 2 column data frame as follows: 我希望重塑它,以便得到如下结果的2列数据框:
Honda Accord
Honda Civic
Honda Pilot
Toyota Prius
Toyota Highlander
and so on. 等等。 I tried str.split trying to split between edits, but was not successful. 我试过str.split尝试在编辑之间分割,但没有成功。 Any suggestions are most appreciated! 任何建议都非常感谢! Python newbie here...so apologies if this has been addressed before. Python新手在这里...如果之前已经解决过这么道歉。 Thanks! 谢谢!
So far I tried 到目前为止我试过了
maker=car['T'].str.extract('(.*\[edit\])', expand=False).str.replace('\[edit\]',"")
This gives me the list of Makers: Honda, Toyota and Ford. 这给了我制造商的名单:本田,丰田和福特。 However I am stuck at finding a way to extract the models between the makers to create the 2 col DF. 然而,我一直在寻找一种方法来提取制造商之间的模型来创建2 col DF。
The trick is to extract the car column first, then to get the maker. 诀窍是首先提取汽车列,然后获取制造商。
import pandas as pd
import numpy as np
df['model'] = df['T'].apply(lambda x: x.split(
'(')[0].strip() if x.count('(') > 0 else np.NaN)
df['maker'] = df['T'].apply(lambda x: x.split('[')[0].strip(
) if x.count('[') > 0 else np.NaN).fillna(method="ffill")
df = df.dropna().drop('T', axis=1).reindex(
columns=['maker', 'model']).reset_index(drop=True)
The first line of the code extracts all the cars by using split and strip string operations if the entry contained '('
, it assigns NaN
otherwise, we use NaN
so that we can delete those rows after finding the makers. At this stage the data frame df
will be: 代码的第一行通过使用拆分和条带字符串操作来提取所有汽车,如果条目包含'('
,它分配NaN
否则我们使用NaN
以便我们可以在找到制造商后删除这些行。在此阶段数据帧df
将是:
+----+-----------------------+------------+
| | T | model |
|----+-----------------------+------------|
| 0 | Honda [edit] | nan |
| 1 | Accord (4 models) | Accord |
| 2 | Civic (4 models) | Civic |
| 3 | Pilot (3 models) | Pilot |
| 4 | Toyota [edit] | nan |
| 5 | Prius (4 models) | Prius |
| 6 | Highlander (3 models) | Highlander |
| 7 | Ford [edit] | nan |
| 8 | Explorer (2 models) | Explorer |
+----+-----------------------+------------+
The second line does the same but for '['
records, here the NaNs
are used to fill forward the empty maker cells using fillna At this stage the data frame df
will be: 第二行做同样的事情但是对于'['
记录,这里NaNs
用于使用fillna填充空的制造商单元格。在这个阶段,数据框df
将是:
+----+-----------------------+------------+---------+
| | T | model | maker |
|----+-----------------------+------------+---------|
| 0 | Honda [edit] | nan | Honda |
| 1 | Accord (4 models) | Accord | Honda |
| 2 | Civic (4 models) | Civic | Honda |
| 3 | Pilot (3 models) | Pilot | Honda |
| 4 | Toyota [edit] | nan | Toyota |
| 5 | Prius (4 models) | Prius | Toyota |
| 6 | Highlander (3 models) | Highlander | Toyota |
| 7 | Ford [edit] | nan | Ford |
| 8 | Explorer (2 models) | Explorer | Ford |
+----+-----------------------+------------+---------+
The third line drops the extra records and rearrange the columns as well as reset the index 第三行删除额外记录并重新排列列以及重置索引
| | maker | model |
|----+---------+------------|
| 0 | Honda | Accord |
| 1 | Honda | Civic |
| 2 | Honda | Pilot |
| 3 | Toyota | Prius |
| 4 | Toyota | Highlander |
| 5 | Ford | Explorer |
EDIT: 编辑:
A more "pandorable" version (I am fond of one liners) 一个更“可爱”的版本(我喜欢一个衬垫)
df = df['T'].str.extractall('(.+)\[|(.+)\(').apply(
lambda x: x.ffill()
if x.name==0
else x).dropna(subset=[1]).reset_index(
drop=True).rename(columns={1:'Model',0:'Maker'})
the above works as follows extractall
will return a DataFrame with two columns; 上述工作原理如下extractall
会返回一个数据帧有两列; column 0
corresponding to the group in the regex extracted using the first group '(.+)\\['
ie the maker records ending with; 第0
列对应于使用第一组'(.+)\\['
提取的正则表达式中的组,即制造商记录以;结尾; and column 1
, corresponding to the second group ie '(.+)\\('
, apply
is used to iterate through the columns, the column named 0
will be modified to propagate the 'Maker' values forward via ffill
and column 1
will be left as is. dropna
is then used with subset 1
to remove all rows where the value in column 1
is NaN
, reset_index
is used to drop the mult-index extractall
generates. finally the columns are renamed using rename
and a correspondence dictionary 和第1
列,对应第二组,即'(.+)\\('
, apply
用于遍历列,名为0
的列将被修改为通过ffill
向前传播'Maker'值,第1
列将是保留原样。 dropna
然后与子集使用1
除去所有行,其中在列中的值为1
是NaN
, reset_index
用于删除MULT指数extractall
生成。最后的列是使用重命名rename
和一个对应的字典
Another one liner (func ;)) 另一个班轮(func;))
df['T'].apply(lambda line: [line.split('[')[0],None] if line.count('[')
else [None,line.split('(')[0].strip()]
).apply(pd.Series
).rename(
columns={0:'Maker',1:'Model'}
).apply(
lambda col: col.ffill() if col.name == 'Maker'
else col).dropna(
subset=['Model']
).reset_index(drop=True)
You can use extract
with ffill
. 你可以使用ffill
extract
。 Then remove rows which contains [edit]
by boolean indexing
and mask by str.contains
, then reset_index
for create unique index
and last remove original column col
by drop
: 然后取出其中包含行[edit]
由boolean indexing
通过和掩码str.contains
,然后reset_index
用于创建唯一index
和最后删除原始列col
由drop
:
df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
model type
0 Honda Accord
1 Honda Civic
2 Honda Pilot
3 Toyota Prius
4 Toyota Highlander
5 Ford Explorer
Another solution use extract
and where
for create new column by condition and last use boolean indexing
again: 另一个解决方案使用extract
和where
按条件创建新列,最后再次使用boolean indexing
:
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df['model'] = df['type'].where(df.col.str.contains('\[edit\]')).ffill()
df = df[df.type != df.model].reset_index(drop=True).drop('col', axis=1)
print (df)
type model
0 Accord Honda
1 Civic Honda
2 Pilot Honda
3 Prius Toyota
4 Highlander Toyota
5 Explorer Ford
EDIT: 编辑:
If need type
with spaces
in text, use replace
all values from ( to the end, also remove spaces by s\\+
: 如果需要在文本中type
spaces
,请使用replace
所有值(到最后,也可以通过s\\+
删除空格:
print (df)
col
0 Honda [edit]
1 Accord (4 models)
2 Civic (4 models)
3 Pilot (3 models)
4 Toyota [edit]
5 Prius (4 models)
6 Highlander (3 models)
7 Ford [edit]
8 Ford Expedition XL (2 models)
df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.replace(r'\s+\(.+$', '')
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
model type
0 Honda Accord
1 Honda Civic
2 Honda Pilot
3 Toyota Prius
4 Toyota Highlander
5 Ford Ford Expedition XL
try 尝试
df.set_index(['Col1', 'Col2'])
It will rearrange in like 它将重新排列
COl1 COl2 honda civic civic accord toyota prius highlander
Please note that this is hierarchical data 请注意,这是分层数据
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.