I want to do the following
split a dataset based on two unique columns make
and model
.
export each resulting data frame into a .csv
file with the unique make and model as file name - make_model
.csv.
for each csv file, I want to do a PCA using numeric columns- mileage
, lease
, mpg
scatterplot for each PCA with each plot having the name of the csv_file, example- honda_accord
place 4 figures per page and export as pdf files
'df'
make, model, year, mileage, lease, mpg, target
1 toyota, corolla, 2015, 17510, 710, 20,Y
2 honda, accord, 2012, 73640, 723, 23,Y
3 toyota, corolla, 2020 28525, 610,24, N
4 kia, sportage, 2018, 31008, 592, 20, N
5 volkswagen, jetta, 2017, 18007, 599,21, Y
6 honda, accord, 2017, 18850, 690, 23,N
7 volkswagen, jetta 2012, 5065, 292, 21, N
8 toyota, highlander,2019, 18004, 729,18, Y
9 volkswagen, jetta, 2016, 8361, 692,21, Y
10 toyota,highlander, 2021, 28643, 729,18, Y
Desired outcome:
# csv_name: toyota_corolla.csv
1 toyota, corolla, 2015, 17510, 710,20, Y
3 toyota, corolla, 2020 28525, 610,24, N
# csv_name: toyota_landcruiser.csv
8 toyota, highlander,2019, 18004, 729,18, Y
10 toyota,highlander, 2021, 28643, 729, 18, Y
#csv_name: honda_accord.csv
2 honda, accord, 2012, 73640, 723, 23, Y
6 honda, accord, 2017, 18850, 690, 23, N
#csv_name: volkswagen_jetta.csv
5 volkswagen, jetta 2012, 5065, 292, 21, N
7 volkswagen, jetta, 2017, 18007, 599,21, Y
9 volkswagen, jetta, 2016, 8361, 692,21, N
See below for my code
in R
# 1. split a dataset based on two unique columns `make` and `model`.
df %>%
select(everything()) %>%
group_split(make,model) -> split_data
#2: export each resulting data frame into a .csv file with the unique make and model as file name - make_model.csv.looping through the list of dataframes
for (i in seq_along(split_data)) {
filename = paste(i, ".csv")
write.csv(data[[i]], filename) # returns only numbers '1.csv' instead of 'make_model.csv'
}
# 3 : for each csv file, I want to do a PCA using numeric columns-`mileage`, `lease` ,`mpg`
library(dplyr)
temp = list.files(pattern="*.csvr (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))
temp.pca <- prcomp(temp[,4:6])
# 4. scatterplot for each PCA with each plot having the name of the csv_file, example- honda_accord
library(ggfortify)
pca.plot <- autoplot(temp.pca, data = pilots, colour = 'target') # use the target Y and N as colors preferably Y-> green and N-> red
pca.plot
# 5. place four figures per page and export as pdf files
par(mfrow=c(2,2))
ggsave('cars_scatterplot.pdf')
In Python
#1. split a dataset based on two unique columns make and model.
#2. export each resulting data frame into a .csv file with the unique make and model as file name - make_model.csv.
for i in df.groupby(["make", "model"])[["make", "model"]].apply(lambda x: list(np.unique(x))):
df.groupby(["make", "model"]).get_group((i[1], i[0])).to_csv(f"{i[1]}_{i[0]}.csv") # for some reason, this does not return all the unique files .csv files
# 3, 4 & 5
import glob
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
fig, axs = plt.subplots(nrows=2,ncols=2,figsize=(9,6))
for ax,file in zip(axs.flatten(), glob.glob("./*csv")):
df = pd.read_csv(file)
df_temp = df.iloc[:,3:5]
make = df['make'][0]
model= df['model'][0]
scaler = StandardScaler()
scaler.fit(df_temp)
scaled_data = scaler.transform(df_temp)
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
ax.scatter(x_pca[:,0],x_pca[:,1],c = colormap[target]) # color "red' and 'green' preferred
ax.set_title(f"make:{make}, model:{model}")
ax.set_xlabel('First principal component')
ax.set_ylabel('Second Principal Component')
plt.tight_layout()
plt.legend()
fig.savefig("car_scatterplot.pdf",dpi = 300)
Thanks for the effort!
Your current code looks quite good. I have changed only some minor things to the python code and got it running.
First, I was creating the data manually so we could reproduce the behavior.
# create data manually to be able to reproduce the scenario
cols = ["make", "model", "year", "mileage", "lease", "mpg", "target"]
rows = [["toyota", "corolla", 2015, 17510, 710, 20, "Y"],
["honda", "accord", 2012, 73640, 723, 23, "Y"],
["toyota", "corolla", 2020, 28525, 610,24, "N"],
["volkswagen", "jetta", 2017, 18007, 599, 21, "Y"],
["honda", "accord", 2017, 18850, 690, 23, "N"],
["volkswagen", "jetta", 2012, 5065, 292, 21, "N"],
["toyota", "highlander", 2019, 18004, 729, 18, "Y"],
["volkswagen", "jetta", 2016, 8361, 692,21, "Y"],
["toyota", "highlander", 2021, 28643, 729,18, "Y"]]
df = pd.DataFrame(columns=cols, data=rows)
As you might notice, I removed the Kia here. This is because we need a minimum of two points for a PCA with n_components=2
.
As your approach for the splitting did not work for me either, I just implemented a quite simple solution.
# 1
groups = df.groupby(["make", "model"]).count().index
Then we can make #2 similar than you already do.
# 2
for group in groups:
df[(df["make"] == group[0]) & (df["model"] == group[1])].to_csv(f"{group[0]}_{group[1]}.csv")
#3 - #5 were basically fine, just keep in mind that the end index is excluded not included, so will have to write:
df_temp = df.iloc[:,3:6]
I have also defined a colormap, so steps 3-5 look like this:
colormap = ['red', 'green', 'blue', 'yellow', 'green']
fig, axs = plt.subplots(nrows=2,ncols=2,figsize=(9,6))
for ax,file in zip(axs.flatten(), glob.glob("./*csv")):
df = pd.read_csv(file)
df_temp = df.iloc[:,3:6]
make = df['make'][0]
model= df['model'][0]
scaler = StandardScaler()
scaler.fit(df_temp)
scaled_data = scaler.transform(df_temp)
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
ax.scatter(x_pca[:,0],x_pca[:,1], c = colormap[:len(x_pca)]) # color "red' and 'green' preferred
ax.set_title(f"make:{make}, model:{model}")
ax.set_xlabel('First principal component')
ax.set_ylabel('Second Principal Component')
plt.tight_layout()
plt.legend()
fig.savefig("car_scatterplot.pdf",dpi = 300)
Edit:
To implement a multipage pdf, I have extended the data to the following:
# create data manually to be able to reproduce the scenario
cols = ["make", "model", "year", "mileage", "lease", "mpg", "target"]
rows = [["toyota", "corolla", 2015, 17510, 710, 20, "Y"],
["honda", "accord", 2012, 73640, 723, 23, "Y"],
["toyota", "corolla", 2020, 28525, 610,24, "N"],
["volkswagen", "jetta", 2017, 18007, 599, 21, "Y"],
["honda", "accord", 2017, 18850, 690, 23, "N"],
["volkswagen", "jetta", 2012, 5065, 292, 21, "N"],
["toyota", "highlander", 2019, 18004, 729, 18, "Y"],
["volkswagen", "jetta", 2016, 8361, 692,21, "Y"],
["toyota", "highlander", 2021, 28643, 729,18, "Y"],
["bmw", "M5", 2005,84392, 649, 25, "Y"],
["bmw", "M5", 2012, 17499, 899, 20, "N"]]
df = pd.DataFrame(columns=cols, data=rows)
df["color"] = df.apply(lambda row: "green" if row["target"] == "Y" else "red", axis=1)
You will have to add the following imports.
from matplotlib.backends.backend_pdf import PdfPages
import os
Then, for 3, 4 and 5, we can do the following that also generates the legends now.
# 3 4 5
files = [x for x in os.listdir("./") if os.path.splitext(x)[1] == ".csv"]
with PdfPages('car_scatterplot.pdf') as pdf:
for i in range(0, len(files), 4):
fig, axs = plt.subplots(nrows=2,ncols=2,figsize=(9,6))
max_len = i + 4 if len(files) > i + 4 else len(files)
for ax, file in zip(axs.flatten(), files[i:max_len]):
df = pd.read_csv(file)
df_temp = df.iloc[:,3:6]
make = df['make'][0]
model= df['model'][0]
scaler = StandardScaler()
scaler.fit(df_temp)
scaled_data = scaler.transform(df_temp)
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
df["x_pca_1"] = x_pca[:,0]
df["x_pca_2"] = x_pca[:,1]
for target in ["Y", "N"]:
sub_df = df[df["target"] == target]
if len(sub_df) > 0:
ax.scatter(sub_df["x_pca_1"],sub_df["x_pca_2"] ,c = sub_df['color'], label=target) # color "red' and 'green' preferred
ax.set_title(f"make:{make}, model:{model}")
ax.set_xlabel('First principal component')
ax.set_ylabel('Second Principal Component')
plt.tight_layout()
ax.legend(loc='best')
pdf.savefig(fig, dpi = 300)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.