I have two dataframe, say df1 and df2, both of these dataframe are very large, having 1 million+ rows and 1000 columns. Now, df1 has a column, say X which has the characters in it (as shown below). And df2 has 900+ columns and each of which needs to be changed based on df1.
df1:
Index ColX ColY
100 C R
101 T Z
102 A Y
... .. ..
df2:
Index ColA ColB ColC ColD ... ...
100 0.033 0.10 0.22 1.22 ... ...
101 1.77 1.34 0.45 1.90 ... ...
102 0.88 1.56 1.99 0.99 ... ...
... ... ... ... ... ... ...
Condition to be applied is that:
If columns in df2 >= 0 and < 1.5, then replace those values with Col X values corresponding to that index.
Elif columns in df2 >= 1.5 and <= 2 then replace those values with Col Y values corresponding to that index
Expected Output:
df2:
Index ColA ColB ColC ColD ... ...
100 C C C C ... ...
101 Z T T Z ... ...
102 A Y Y A ... ...
... ... ... ... ... ... ...
I tried this way:
for v in df2.columns.tolist():
df2 = df2.loc[(df2[v] >= 0) & (df2[v] < 1.5) , v] = df1['ColX']
Sometimes this is working, sometimes it is not (for the first case) but this method is very slow. I have a very big file.
Please someone can tell me any efficient way to do this. Thankx in Advance.
Maybe it is to slow but this yields the desired result:
for v in df2.columns:
ok = (df2[v] >= 0) & (df2[v] < 1.5)
df2.loc[ok, v] = df1.loc[ok, 'ColX']
df2.loc[~ok, v] = df1.loc[~ok, 'ColY']
If there is same index in both DataFrames use numpy.select
with repeating values by broadcasting:
arr = df2.values
m1 = (arr >= 0) & (arr < 1.5)
m2 = (arr >= 1.2) & (arr <= 2)
a1 = df1['ColX'].values[:, None]
a2 = df1['ColY'].values[:, None]
df = pd.DataFrame(np.select([m1, m2], [a1, a2]), index=df2.index, columns=df2.columns)
print (df)
ColA ColB ColC ColD
100 C C C C
101 Z T T Z
102 A Y Y A
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.