I have a pandas dataframe that contains the iris
dataset. I want to subset this dataframe to only include sepal_length
and species
, and then reshape it so that the columns are the unique values for species
and the values are the values for that species.
# load data into a dataframe
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
head(df)
+----+---------------+--------------+---------------+--------------+---------+
| | sepal_length | sepal_width | petal_length | petal_width | species |
+----+---------------+--------------+---------------+--------------+---------+
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
+----+---------------+--------------+---------------+--------------+---------+
I can do this is I take the data out of Pandas as use a dictionary to reshape the data, but I can't figure out how to do it within pandas.
data = df.to_dict('records')
e = {}
for line in data:
e[line['species']] = []
for line in data:
e[line['species']].append(line['sepal_length'])
new = pd.DataFrame(e)
This is what I want to end up with:
+----+---------+-------------+-----------+
| | setosa | versicolor | virginica |
+----+---------+-------------+-----------+
| 0 | 5.1 | 7.0 | 6.3 |
| 1 | 4.9 | 6.4 | 5.8 |
| 2 | 4.7 | 6.9 | 7.1 |
| 3 | 4.6 | 5.5 | 6.3 |
| 4 | 5.0 | 6.5 | 6.5 |
+----+---------+-------------+-----------+
I've tried using pd.crosstab(df['sepal_length'], df['species'])
but that doesn't get me what I want. I've also tried using df.pivot_table('sepal_length', columns='species')
and that also isn't it.
What am I missing here?
IIUC you can use grouby.cumcount
on species
col and set index, then use pivot
instead of pivot_table
which does not requires an agg
func.
df1 = df.set_index(df.groupby('species').cumcount())
df1 = df1.pivot(columns='species', values='sepal_length').rename_axis(None,axis=1)
print (df1)
setosa versicolor virginica
0 5.1 7.0 6.3
1 4.9 6.4 5.8
2 4.7 6.9 7.1
3 4.6 5.5 6.3
4 5.0 6.5 6.5
What you're trying to do will take a few steps. (The code below assumes use of the standard "Iris dataset" ).
First, let's subset your DataFrame
by only the columns we need.
df_subset = df[['sepal_length','species']]
Next, use pandas.pivot
(intead of pandas.pivot_table
) to convert your DataFrame
from "long" to "flat".
df_pivot = df_subset.pivot(columns='species',values='sepal_length')
Now, we're close to what you wanted but because your three species
columns run along the same index, the pivoted DataFrame
returns NaN
s for two of the three columns for any given row. We can work around this by column-wise concatenating the DataFrame
while re-indexing it. (Essentially creating three DataFrames
- one for each species - and joining them along a new index). We can do this one of two ways:
The compact solution:
names = ['setosa','versicolor','virginica'] df_final = pd.concat(map(lambda name: df_pivot[name].dropna().reset_index().drop('index',axis=1), names), axis=1)
Which is equivalent to:
df_final = pd.concat([ df_pivot['setosa'].dropna().reset_index().drop('index',axis=1), df_pivot['versicolor'].dropna().reset_index().drop('index',axis=1), df_pivot['virginica'].dropna().reset_index().drop('index',axis=1)],axis=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.