[英]Split a cell data into multiple rows in using python
I want to split the data contained in a cell into multiple rows in using python.我想使用 python 将单元格中包含的数据拆分为多行。 Such an example is given below:
下面给出了这样一个例子:
This is my data:这是我的数据:
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
ethanol/gas FC SUV 6/8 9/14 15/20 1/16 yes
ethanol/gas FC SUV 6/3 1/14 14/19 10/16 no
I want to convert it into this form:我想把它转换成这种形式:
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
ethanol FC SUV 6 9 15 1 yes
gas FC SUV 8 14 20 16 yes
ethanol FC SUV 6 1 14 10 no
gas FC SUV 3 14 19 16 no
The following code is returning an error:以下代码返回错误:
import numpy as np
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))
# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
'cert_region': np.repeat(df_08['cert_region'], lens),
'veh_class': np.repeat(df_08['veh_class'], lens),
'smartway': np.repeat(df_08['smartway'], lens),
'fuel': chainer(df_08['fuel']),
'air_pollution': chainer(df_08['air_pollution']),
'city_mpg': chainer(df_08['city_mpg']),
'hwy_mpg': chainer(df_08['hwy_mpg']),
'cmb_mpg': chainer(df_08['cmb_mpg'])})
It gives me this error:它给了我这个错误:
TypeError Traceback (most recent call last)
<ipython-input-31-916fed75eee2> in <module>()
20 'fuel': chainer(df_08['fuel']),
21 'air_pollution_score': chainer(df_08['air_pollution_score']),
---> 22 'city_mpg': chainer(df_08['city_mpg']),
23 'hwy_mpg': chainer(df_08['hwy_mpg']),
24 'cmb_mpg': chainer(df_08['cmb_mpg']),
<ipython-input-31-916fed75eee2> in chainer(s)
4 # return list from series of comma-separated strings
5 def chainer(s):
----> 6 return list(chain.from_iterable(s.str.split('/')))
7
8 # calculate lengths of splits
TypeError: 'float' object is not iterable
But city_mpg
has the Object
data type:但是
city_mpg
具有Object
数据类型:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2404 entries, 0 to 2403
Data columns (total 14 columns):
fuel 2404 non-null object
cert_region 2404 non-null object
veh_class 2404 non-null object
air_pollution 2404 non-null object
city_mpg 2205 non-null object
hwy_mpg 2205 non-null object
cmb_mpg 2205 non-null object
smartway 2404 non-null object
my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe.我的建议是退出 pandas,进行计算并将结果放回 dataframe。 in my opinion, it is much easier to manipulate, and I'd like to believe faster:
在我看来,操纵起来要容易得多,而且我想更快地相信:
from itertools import chain
Step 1: convert to dict:第1步:转换为dict:
M = df.to_dict('records')
Step 2: do a list comprehension and split the values:第 2 步:进行列表理解并拆分值:
res = [[(key,*value.split('/'))
for key,value in d.items()]
for d in M]
Step 3: find the length of the longest row.第三步:求最长行的长度。 We need this to ensure all rows are the same length:
我们需要这个来确保所有行的长度相同:
longest = max(len(line) for line in chain(*res))
print(longest)
#3
Step 4: the longest entry is 3;第四步:最长的条目是3; we need to ensure that the lines less than 3 are adjusted:
我们需要确保调整小于 3 的行:
explode = [[(entry[0], entry[-1], entry[-1])
if len(entry) < longest else entry for entry in box]
for box in res]
print(explode)
[[('fuel', 'ethanol', 'gas'),
('cert_region', 'FC', 'FC'),
('veh_class', 'SUV', 'SUV'),
('air_pollution', '6', '8'),
('city_mpg', '9', '14'),
('hwy_mpg', '15', '20'),
('cmb_mpg', '1', '16'),
('smartway', 'yes', 'yes')],
[('fuel', 'ethanol', 'gas'),
('cert_region', 'FC', 'FC'),
('veh_class', 'SUV', 'SUV'),
('air_pollution', '6', '3'),
('city_mpg', '1', '14'),
('hwy_mpg', '14', '19'),
('cmb_mpg', '10', '16'),
('smartway', 'no', 'no')]]
Step 4: Now we can pair the keys, with respective values to get a dictionary:第 4 步:现在我们可以将键与各自的值配对以获取字典:
result = {start[0] :(*start[1:],*end[1:])
for start,end in zip(*explode)}
print(result)
{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
'cert_region': ('FC', 'FC', 'FC', 'FC'),
'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
'air_pollution': ('6', '8', '6', '3'),
'city_mpg': ('9', '14', '1', '14'),
'hwy_mpg': ('15', '20', '14', '19'),
'cmb_mpg': ('1', '16', '10', '16'),
'smartway': ('yes', 'yes', 'no', 'no')}
Read result into dataframe:将结果读入 dataframe:
pd.DataFrame(result)
fuel cert_region veh_class air_pollution city_mpg hwy_mpg cmb_mpg smartway
0 ethanol FC SUV 6 9 15 1 yes
1 gas FC SUV 8 14 20 16 yes
2 ethanol FC SUV 6 1 14 10 no
3 gas FC SUV 3 14 19 16 no
I think you're better off constructing a new dataframe我认为你最好构建一个新的 dataframe
result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
temp1 = {}
temp2 = {}
for key, value in dict(series).items():
if '/' in value:
val1, val2 = value.split('/')
temp1[key] = [val1]
temp2[key] = [val2]
else:
temp1[key] = temp2[key] = [value]
result = pd.concat([result, pd.DataFrame(data=temp1),
pd.DataFrame(data=temp2)], axis=0, ignore_index=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.