简体   繁体   English

熊猫重新排列数据框

[英]pandas rearranging a data frame

I have a data frame that is as follows: 我有一个数据框如下:

Honda [edit]
Accord (4 models)
Civic  (4 models)
Pilot  (3 models)
Toyota [edit]
Prius  (4 models)
Highlander (3 models)
Ford [edit]
Explorer (2 models)

I am looking to reshape it such that I get a resulting 2 column data frame as follows: 我希望重塑它,以便得到如下结果的2列数据框:

 Honda     Accord
 Honda     Civic
 Honda     Pilot
 Toyota    Prius
 Toyota    Highlander

and so on. 等等。 I tried str.split trying to split between edits, but was not successful. 我试过str.split尝试在编辑之间分割,但没有成功。 Any suggestions are most appreciated! 任何建议都非常感谢! Python newbie here...so apologies if this has been addressed before. Python新手在这里...如果之前已经解决过这么道歉。 Thanks! 谢谢!

So far I tried 到目前为止我试过了

     maker=car['T'].str.extract('(.*\[edit\])', expand=False).str.replace('\[edit\]',"")

This gives me the list of Makers: Honda, Toyota and Ford. 这给了我制造商的名单:本田,丰田和福特。 However I am stuck at finding a way to extract the models between the makers to create the 2 col DF. 然而,我一直在寻找一种方法来提取制造商之间的模型来创建2 col DF。

The trick is to extract the car column first, then to get the maker. 诀窍是首先提取汽车列,然后获取制造商。

import pandas as pd
import numpy as np

df['model'] = df['T'].apply(lambda x: x.split(
    '(')[0].strip() if x.count('(') > 0 else np.NaN)

df['maker'] = df['T'].apply(lambda x: x.split('[')[0].strip(
) if x.count('[') > 0 else np.NaN).fillna(method="ffill")

df = df.dropna().drop('T', axis=1).reindex(
    columns=['maker', 'model']).reset_index(drop=True)

The first line of the code extracts all the cars by using split and strip string operations if the entry contained '(' , it assigns NaN otherwise, we use NaN so that we can delete those rows after finding the makers. At this stage the data frame df will be: 代码的第一行通过使用拆分和条带字符串操作来提取所有汽车,如果条目包含'(' ,它分配NaN否则我们使用NaN以便我们可以在找到制造商后删除这些行。在此阶段数据帧df将是:

+----+-----------------------+------------+
|    | T                     | model      |
|----+-----------------------+------------|
|  0 | Honda [edit]          | nan        |
|  1 | Accord (4 models)     | Accord     |
|  2 | Civic  (4 models)     | Civic      |
|  3 | Pilot  (3 models)     | Pilot      |
|  4 | Toyota [edit]         | nan        |
|  5 | Prius  (4 models)     | Prius      |
|  6 | Highlander (3 models) | Highlander |
|  7 | Ford [edit]           | nan        |
|  8 | Explorer (2 models)   | Explorer   |
+----+-----------------------+------------+

The second line does the same but for '[' records, here the NaNs are used to fill forward the empty maker cells using fillna At this stage the data frame df will be: 第二行做同样的事情但是对于'['记录,这里NaNs用于使用fillna填充空的制造商单元格。在这个阶段,数据框df将是:

+----+-----------------------+------------+---------+
|    | T                     | model      | maker   |
|----+-----------------------+------------+---------|
|  0 | Honda [edit]          | nan        | Honda   |
|  1 | Accord (4 models)     | Accord     | Honda   |
|  2 | Civic  (4 models)     | Civic      | Honda   |
|  3 | Pilot  (3 models)     | Pilot      | Honda   |
|  4 | Toyota [edit]         | nan        | Toyota  |
|  5 | Prius  (4 models)     | Prius      | Toyota  |
|  6 | Highlander (3 models) | Highlander | Toyota  |
|  7 | Ford [edit]           | nan        | Ford    |
|  8 | Explorer (2 models)   | Explorer   | Ford    |
+----+-----------------------+------------+---------+

The third line drops the extra records and rearrange the columns as well as reset the index 第三行删除额外记录并重新排列列以及重置索引

|    | maker   | model      |
|----+---------+------------|
|  0 | Honda   | Accord     |
|  1 | Honda   | Civic      |
|  2 | Honda   | Pilot      |
|  3 | Toyota  | Prius      |
|  4 | Toyota  | Highlander |
|  5 | Ford    | Explorer   |

EDIT: 编辑:

A more "pandorable" version (I am fond of one liners) 一个更“可爱”的版本(我喜欢一个衬垫)

df = df['T'].str.extractall('(.+)\[|(.+)\(').apply(
    lambda x: x.ffill() 
    if x.name==0 
    else x).dropna(subset=[1]).reset_index(
    drop=True).rename(columns={1:'Model',0:'Maker'})

the above works as follows extractall will return a DataFrame with two columns; 上述工作原理如下extractall会返回一个数据帧有两列; column 0 corresponding to the group in the regex extracted using the first group '(.+)\\[' ie the maker records ending with; 0列对应于使用第一组'(.+)\\['提取的正则表达式中的组,即制造商记录以;结尾; and column 1 , corresponding to the second group ie '(.+)\\(' , apply is used to iterate through the columns, the column named 0 will be modified to propagate the 'Maker' values forward via ffill and column 1 will be left as is. dropna is then used with subset 1 to remove all rows where the value in column 1 is NaN , reset_index is used to drop the mult-index extractall generates. finally the columns are renamed using rename and a correspondence dictionary 和第1列,对应第二组,即'(.+)\\('apply用于遍历列,名为0的列将被修改为通过ffill向前传播'Maker'值,第1列将是保留原样。 dropna然后与子集使用1除去所有行,其中在列中的值为1NaNreset_index用于删除MULT指数extractall生成。最后的列是使用重命名rename和一个对应的字典

在此输入图像描述

Another one liner (func ;)) 另一个班轮(func;))

 df['T'].apply(lambda line: [line.split('[')[0],None] if line.count('[') 
                       else [None,line.split('(')[0].strip()]
              ).apply(pd.Series
                      ).rename(
                            columns={0:'Maker',1:'Model'}
                        ).apply(
                         lambda col: col.ffill() if col.name == 'Maker' 
                         else col).dropna(
                                    subset=['Model']
                                    ).reset_index(drop=True)

You can use extract with ffill . 你可以使用ffill extract Then remove rows which contains [edit] by boolean indexing and mask by str.contains , then reset_index for create unique index and last remove original column col by drop : 然后取出其中包含行[edit]boolean indexing通过和掩码str.contains ,然后reset_index用于创建唯一index和最后删除原始列coldrop

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
     model        type
0   Honda       Accord
1   Honda        Civic
2   Honda        Pilot
3  Toyota        Prius
4  Toyota   Highlander
5    Ford     Explorer

Another solution use extract and where for create new column by condition and last use boolean indexing again: 另一个解决方案使用extractwhere按条件创建新列,最后再次使用boolean indexing

df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False)
df['model'] = df['type'].where(df.col.str.contains('\[edit\]')).ffill()
df = df[df.type != df.model].reset_index(drop=True).drop('col', axis=1)
print (df)
         type   model
0      Accord   Honda
1       Civic   Honda
2       Pilot   Honda
3       Prius  Toyota
4  Highlander  Toyota
5    Explorer    Ford

EDIT: 编辑:

If need type with spaces in text, use replace all values from ( to the end, also remove spaces by s\\+ : 如果需要在文本中type spaces ,请使用replace所有值(到最后,也可以通过s\\+删除空格:

print (df)
                             col
0                   Honda [edit]
1              Accord (4 models)
2              Civic  (4 models)
3              Pilot  (3 models)
4                  Toyota [edit]
5              Prius  (4 models)
6          Highlander (3 models)
7                    Ford [edit]
8  Ford Expedition XL (2 models)

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill()
df['type'] = df.col.str.replace(r'\s+\(.+$', '')
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1)
print (df)
     model                type
0   Honda               Accord
1   Honda                Civic
2   Honda                Pilot
3  Toyota                Prius
4  Toyota           Highlander
5    Ford   Ford Expedition XL

try 尝试
df.set_index(['Col1', 'Col2'])

It will rearrange in like 它将重新排列

COl1 COl2 honda civic civic accord toyota prius highlander

Please note that this is hierarchical data 请注意,这是分层数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM