简体   繁体   中英

if value = string in both row and col of df then return new df

Not sure how to return the desired output for k for the condition on the string "START".

The column names dont need to be used/retained.

NOTE : I will not know in which row and column my string will be, hence the use of the for/if statements.

Input:

df = pd.DataFrame({'A': ['43', 23, 'Ndfg', 34, 0, 56],
               'B': ['5', 23, 'START', 89, 0, 4],
               'C': ['65', 7, 'dsfgA', 65, 47, 3],
               'D': ['65', 7, 'gfd', 3, 0, 7],
               'E': ['76', 7, 'Start', 5, 12, 1],
               'F': ['65', 7, 'sdfA', 5, 0, 4],
               'G': ['12', 7, 'START', 5, 8, 9],
               'H': ['89', 7, 'gfA', 5, 0, 8],
               'I': ['23', 7, 'sdfA', 5, 7, 23]})

k = []
for rw in range(df.shape[0]):
    for colm in range(df.shape[1]):
        if df.iloc[rw, colm] == 'START':
            row = rw + 1
            col = colm
            k.append(df.iloc[row:, col:])
            break

Output : df and k

df

          A      B      C    D      E     F      G    H     I
0         43      5     65   65     76    65     12   89    23
1         23     23      7    7      7     7      7    7     7
2         Ndfg  START  dsfgA  gfd  Start  sdfA  START  gfA  sdfA
3         34     89     65    3      5     5      5    5     5
4         0      0     47    0     12     0      8    0     7
5         56      4      3    7      1     4      9    8    23

k

[    B   C  D   E  F  G  H   I
 3  89  65  3   5  5  5  5   5
 4   0  47  0  12  0  8  0   7
 5   4   3  7   1  4  9  8  23]

Desired Output for k : Only the columns of numbers under "Start"

    [    B   E  G
     3  89   5  5
     4   0   12 8
     5   4   1  9]
(df[df.astype(str)
      .apply(lambda s: s.str.lower())
      .eq("start")
      .cummax()]
   .dropna(how="all", axis=0)
   .dropna(how="all", axis=1)
   .iloc[1:])
  • get a True/False mask dataframe indicating value is "start" or not; to do this

    • convert to str
    • convert to lowercase
    • check equality against "start"

now we have

In [350]: df
Out[350]:
      A      B      C    D      E     F      G    H     I
0    43      5     65   65     76    65     12   89    23
1    23     23      7    7      7     7      7    7     7
2  Ndfg  START  dsfgA  gfd  Start  sdfA  START  gfA  sdfA
3    34     89     65    3      5     5      5    5     5
4     0      0     47    0     12     0      8    0     7
5    56      4      3    7      1     4      9    8    23

In [351]: df.astype(str).apply(lambda s: s.str.lower()).eq("start")
Out[351]:
       A      B      C      D      E      F      G      H      I
0  False  False  False  False  False  False  False  False  False
1  False  False  False  False  False  False  False  False  False
2  False   True  False  False   True  False   True  False  False
3  False  False  False  False  False  False  False  False  False
4  False  False  False  False  False  False  False  False  False
5  False  False  False  False  False  False  False  False  False
  • now the "region" marking: want below parts of True
    • False is 0, True is 1 in numeric context
    • what if we took cumulative maximum?
In [352]: _.cummax()
Out[352]:
       A      B      C      D      E      F      G      H      I
0  False  False  False  False  False  False  False  False  False
1  False  False  False  False  False  False  False  False  False
2  False   True  False  False   True  False   True  False  False
3  False   True  False  False   True  False   True  False  False
4  False   True  False  False   True  False   True  False  False
5  False   True  False  False   True  False   True  False  False

(_ is the last output above this)

  • ok we got the Trues and Falses in place
    • index into df with it
    • values in Trues will be kept, Falses will see NaN
In [357]: df[_]
Out[357]:
     A      B    C    D      E    F      G    H    I
0  NaN    NaN  NaN  NaN    NaN  NaN    NaN  NaN  NaN
1  NaN    NaN  NaN  NaN    NaN  NaN    NaN  NaN  NaN
2  NaN  START  NaN  NaN  Start  NaN  START  NaN  NaN
3  NaN     89  NaN  NaN      5  NaN      5  NaN  NaN
4  NaN      0  NaN  NaN     12  NaN      8  NaN  NaN
5  NaN      4  NaN  NaN      1  NaN      9  NaN  NaN
  • almost there; NaNs need to go
    • dropna over rows but with how="all"
      • so that only if a row is all NaNs, it is dropped
    • do the same for columns too
 In [358]: _.dropna(how="all", axis=0).dropna(how="all", axis=1)
Out[358]:
       B      E      G
2  START  Start  START
3     89      5      5
4      0     12      8
5      4      1      9
  • last thing: no need the "start" row, so .iloc[1:] it out
In [359]: _.iloc[1:]
Out[359]:
    B   E  G
3  89   5  5
4   0  12  8
5   4   1  9

extra: time comparison for a frame with ~50_000 entries
 # this is 6000 x 9 In [364]: df = pd.concat([df] * 1000, ignore_index=True) In [366]: %%timeit...: (...: df...: .applymap(lambda x: type(x) == str and x.upper() == "START")...: .stack()...: .loc[lambda x: x]...: .reset_index()...: .apply(lambda x: df.loc[slice(x["level_0"] + 1, None), x['level_1']], axis=1)...: .T...: )...: ...: 4.11 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [367]: %%timeit...: (df[df.astype(str)...: .apply(lambda s: s.str.lower())...: .eq("start").cummax()]...: .dropna(how="all", axis=0)...: .dropna(how="all", axis=1)...: .iloc[1:])...: 27.1 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

~150 times

One way to do:

df2 = df.apply(lambda x:x.str.upper()) #First convert str to upper
# Find the index where the START may appear - I am assuming there is only one such row.
df2 = df2.query('|'.join(df2.columns.map('{}=="START"'.format).tolist())) 
row_index = df2.index[0] #get the index where START appears
s = df2.eq('START').iloc[0] #Find 'START'
col_index = s[s].index # get the columns where START appears
df.iloc[row_index+1:][col_index] # Then locate

output:

    B   E  G
3  89   5  5
4   0  12  8
5   4   1  9
(
    df
    .applymap(lambda x: type(x) == str and x.upper() == "START")
    .stack()
    .loc[lambda x: x]
    .reset_index()
    .apply(lambda x: df.loc[slice(x["level_0"] + 1, None), x['level_1']], axis=1)
    .T
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM