使用 python 将单元格数据拆分为多行

Question

I want to split the data contained in a cell into multiple rows in using python.我想使用 python 将单元格中包含的数据拆分为多行。 Such an example is given below:下面给出了这样一个例子：

This is my data:这是我的数据：

fuel          cert_region   veh_class   air_pollution      city_mpg     hwy_mpg    cmb_mpg  smartway
ethanol/gas    FC              SUV          6/8              9/14        15/20      1/16      yes
ethanol/gas    FC              SUV          6/3              1/14        14/19      10/16     no

I want to convert it into this form:我想把它转换成这种形式：

fuel          cert_region   veh_class   air_pollution     city_mpg     hwy_mpg    cmb_mpg   smartway
ethanol         FC             SUV          6               9           15          1          yes
 gas            FC             SUV          8               14          20          16         yes
ethanol         FC             SUV          6               1           14          10         no  
 gas            FC             SUV          3               14          19          16         no

The following code is returning an error:以下代码返回错误：

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))

# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
                'cert_region': np.repeat(df_08['cert_region'], lens),
                'veh_class': np.repeat(df_08['veh_class'], lens),
                'smartway': np.repeat(df_08['smartway'], lens),
                'fuel': chainer(df_08['fuel']),
                'air_pollution': chainer(df_08['air_pollution']),
                'city_mpg': chainer(df_08['city_mpg']),
               'hwy_mpg': chainer(df_08['hwy_mpg']),
               'cmb_mpg': chainer(df_08['cmb_mpg'])})

It gives me this error:它给了我这个错误：

 TypeError                                 Traceback (most recent call last)
 <ipython-input-31-916fed75eee2> in <module>()
 20                     'fuel': chainer(df_08['fuel']),
 21                     'air_pollution_score': chainer(df_08['air_pollution_score']),
 ---> 22                     'city_mpg': chainer(df_08['city_mpg']),
 23                    'hwy_mpg': chainer(df_08['hwy_mpg']),
 24                    'cmb_mpg': chainer(df_08['cmb_mpg']),

  <ipython-input-31-916fed75eee2> in chainer(s)
  4 # return list from series of comma-separated strings
  5 def chainer(s):
  ----> 6     return list(chain.from_iterable(s.str.split('/')))
  7 
  8 # calculate lengths of splits

  TypeError: 'float' object is not iterable

But city_mpg has the Object data type:但是city_mpg具有Object数据类型：

   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 2404 entries, 0 to 2403
   Data columns (total 14 columns):
  fuel                    2404 non-null object
  cert_region             2404 non-null object
  veh_class               2404 non-null object
  air_pollution           2404 non-null object
  city_mpg                2205 non-null object
  hwy_mpg                 2205 non-null object
  cmb_mpg                 2205 non-null object
  smartway                2404 non-null object

Answer 1

my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe.我的建议是退出 pandas，进行计算并将结果放回 dataframe。 in my opinion, it is much easier to manipulate, and I'd like to believe faster:在我看来，操纵起来要容易得多，而且我想更快地相信：

from itertools import chain

Step 1: convert to dict:第1步：转换为dict：

M = df.to_dict('records')

Step 2: do a list comprehension and split the values:第 2 步：进行列表理解并拆分值：

res = [[(key,*value.split('/'))
       for key,value in d.items()]
       for d in M]

Step 3: find the length of the longest row.第三步：求最长行的长度。 We need this to ensure all rows are the same length:我们需要这个来确保所有行的长度相同：

 longest = max(len(line) for line in chain(*res))
 print(longest)
 #3

Step 4: the longest entry is 3;第四步：最长的条目是3； we need to ensure that the lines less than 3 are adjusted:我们需要确保调整小于 3 的行：

explode = [[(entry[0], entry[-1], entry[-1])
            if len(entry) < longest else entry for entry in box]
            for box in res]

print(explode)

[[('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '8'),
  ('city_mpg', '9', '14'),
  ('hwy_mpg', '15', '20'),
  ('cmb_mpg', '1', '16'),
  ('smartway', 'yes', 'yes')],
 [('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '3'),
  ('city_mpg', '1', '14'),
  ('hwy_mpg', '14', '19'),
  ('cmb_mpg', '10', '16'),
  ('smartway', 'no', 'no')]]

Step 4: Now we can pair the keys, with respective values to get a dictionary:第 4 步：现在我们可以将键与各自的值配对以获取字典：

result = {start[0] :(*start[1:],*end[1:])
          for start,end in zip(*explode)}

print(result)

{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
 'cert_region': ('FC', 'FC', 'FC', 'FC'),
 'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
 'air_pollution': ('6', '8', '6', '3'),
 'city_mpg': ('9', '14', '1', '14'),
 'hwy_mpg': ('15', '20', '14', '19'),
 'cmb_mpg': ('1', '16', '10', '16'),
 'smartway': ('yes', 'yes', 'no', 'no')}

Read result into dataframe:将结果读入 dataframe：

pd.DataFrame(result)

    fuel    cert_region veh_class   air_pollution   city_mpg    hwy_mpg cmb_mpg smartway
0   ethanol     FC       SUV           6       9            15             1     yes
1   gas         FC       SUV           8       14           20             16    yes
2   ethanol     FC       SUV           6       1            14             10    no
3   gas         FC       SUV           3       14           19             16    no

Answer 2

I think you're better off constructing a new dataframe我认为你最好构建一个新的 dataframe

result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
    temp1 = {}
    temp2 = {}
    for key, value in dict(series).items():
        if '/' in value:
            val1, val2 = value.split('/')
            temp1[key] = [val1]
            temp2[key] = [val2]
        else:
            temp1[key] = temp2[key] = [value]

    result = pd.concat([result, pd.DataFrame(data=temp1), 
                        pd.DataFrame(data=temp2)], axis=0, ignore_index=True)

使用 python 将单元格数据拆分为多行

问题描述

2 个解决方案

解决方案1
3 2020-04-11 00:02:10

解决方案2
1 2020-04-10 23:02:14

使用 python 将单元格数据拆分为多行

问题描述

2 个解决方案

解决方案1 3 2020-04-11 00:02:10

解决方案2 1 2020-04-10 23:02:14

解决方案1
3 2020-04-11 00:02:10

解决方案2
1 2020-04-10 23:02:14