I have data frame in the below format. The description is in string format.
file | description |
---|---|
x | [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]] |
y | [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]] |
How can i convert data Frame into below format.
file | license | score |
---|---|---|
x | ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] | [0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217] |
y | ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'] | [0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457] |
Above is just an example. Data frame is very large.
Update, If elements in the column as string format, you can find array with regex
formula. (Note don't use eval, Why should exec() and eval() be avoided? )
import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]),
'score':ast.literal_eval(x[1])})
df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)
Output:
file licence score
0 x ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1 y ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ... [0.28552457, 0.28552457, 0.28552457, 0.2855245...
How regex
formula find arrays: (find string start with [
and end with ]
but in finding string should not have [
or ]
to find all arrays.)
>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
'[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']
Old, You can create new column then join with old dataframe
.
new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)
Input:
file description
0 x [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1 y [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...
Doing:
import ast
df.description = (df.description.str.replace('array', '')
.str.replace(', dtype=object', '')
.apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)
Output:
file license score
0 x [MIT, MIT, MIT, MIT, MIT] [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1 y [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-... [0.28552457, 0.28552457, 0.28552457, 0.2855245...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.