I need some help here. I'm trying to change one column in my .csv file, which some are empty and some are with a list of categories. As follow:
tdaa_matParent,tdaa_matParentQty
[],[]
[],[]
[],[]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[],[]
[Dye Penetrant Solution, BCA_Aluminum],[0.002118882, 1.3458]
But so far I managed to only binarize the first column (tdaa_matParent), but not able to replace the 1s to their corresponding quantity value, like this.
s = materials['tdaa_matParent']
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
BCA_Aluminum,Dye Penetrant Solution,tdaa_matParentQty
0,0,[]
0,0,[]
0,0,[]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
0,0,[]
1,1,[1.3458,0.002118882]
But what I really want is a new set of columns for each column category (ie BCA_Aluminum and Dye Penetrant Solution). Also each of the columns if filled to be replaced by the second column´s (tdaa_matParentQty) value.
For example:
BCA_Aluminum,Dye Penetrant Solution
0,0
0,0
0,0
1.3458,0
1.3458,0
1.3458,0
1.3458,0
0,0
1.3458,0.002118882
Thanks! I built another approach that also works (bit slower though). Any suggestions, feel free to share :)
df_matParent_with_Qty = pd.DataFrame()
# For each row in the dataframe (index and row´s column info),
for index, row in ass_materials.iterrows():
# For each row iteration save name of the element (matParent) and it´s index number:
for i, element in enumerate(row["tdaa_matParent"]):
# print(i)
# print(element)
# Fill in the empty dataframe with lists from each element
# And in each of their corresponding index (row), replace it with the value index inside the matParentqty list.
df_matParent_with_Qty.loc[index,element] = row['tdaa_matParentQty'][i]
df_matParent_with_Qty.head(10)
This is how I would do it with built-in Python means for the sample data provided in the question:
from collections import OrderedDict
import pandas as pd
# simple case - material names are known before we process the data - allows to solve the problem with a single for loop
# OrderedDict is used to preserve the order of material names during the processing
base_result = OrderedDict([
('BCA_Aluminum', .0),
('Dye Penetrant Solution', .0)])
result = list()
with open('1.txt', mode='r', encoding='UTF-8') as file:
# skip header
file.readline()
for line in file:
# copy base_result to reuse it during the looping
base_result_copy = base_result.copy()
# modify base result only if there are values in the current line
if line != '[],[]\n':
names, values = line.strip('[]\n').split('],[')
for name, value in zip(names.split(', '), values.split(', ')):
base_result_copy[name] = float(value)
# append new line (base or modified) to the result
result.append(base_result_copy.values())
# turn list of lists into pandas dataframe
result = pd.DataFrame(result, columns=base_result.keys())
print(result)
Output:
BCA_Aluminum Dye Penetrant Solution
0 0.0000 0.000000
1 0.0000 0.000000
2 0.0000 0.000000
3 1.3458 0.000000
4 1.3458 0.000000
5 1.3458 0.000000
6 1.3458 0.000000
7 0.0000 0.000000
8 1.3458 0.002119
0.002119
instead of 0.002118882
is because of how pandas displays floats by default, original precision is preserved in the actual data in the dataframe.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.