简体   繁体   中英

Python convert single column into multiple Data Frames

I've used python and PYtesseract to run OCR on an image. Here's my code:

test = pytesseract.image_to_string(img)

and then I converted that to a data frame:

data = io.StringIO(result)
df = pd.read_csv(data, index_col=False, sep=",")

however this stores all the data from the image into a single column. formatted like this:

TimeLine (column header)<break>
schedule<break>
log_in<break>
log_out
Advisor (should be the second column header)
James
Mathew
Kent

I want to split the column horizontally into separate data frames. So that it will be formatted like this:

Timeline(header) Advisor(header)

Schedule James

Log_in Mathew

Log_out Kent

The issue is that the values are not all the same, so I can't use a group by function. I also can't use the df.iloc[0:3] option either because the values will not consistently be on the same rows every time I do this. I've tried using new_df = df.loc[:'Advisor'] to try and define a new data frame, but all that does is return the entire data frame without an error.

Is there a way to tell it to split horizontally into a separate data Frame based on a unique cell value? So like, split df where column value = 'Advisor'.

The split function is easy to use if I want to split things vertically. But I can't see an easy way to split a column horizontally based on a unique value within that column.

I'm super frustrated because this has to be something that happens all the time but I've been looking around for hours and can't find any solution.

This needs a few steps

We start with the df:

current_df = pd.DataFrame({'Timeline': ['Schedule', 'Log_in', 'Log_out', 'Advisor', 'James', 'Mathew', 'Kent']}) 

For the current_df , we need to set the row and column indices by breaking the column of curent_df at the Advisor value. Slightly complicated by the column labels vs values, but we can do it like so:

df = current_df.T.reset_index().T
df['colnum'] = (df[0]=='Advisor').cumsum()
df['rownum'] = df.groupby('colnum', group_keys = False).cumcount()

at this point df looks like this. Note 'colnum' and 'rownum' that will be used later

       0           colnum    rownum
-----  --------  --------  --------
index  Timeline         0         0
0      Schedule         0         1
1      Log_in           0         2
2      Log_out          0         3
3      Advisor          1         0
4      James            1         1
5      Mathew           1         2
6      Kent             1         3

Once these values are set it becomes a simple matter of unstack with a few cleanup operations

df = (df.set_index(['rownum', 'colnum'])
        .unstack()
        .droplevel(level=0,axis=1)
        .T.set_index(0).T
)

output:

  rownum  Timeline    Advisor
--------  ----------  ---------
       1  Schedule    James
       2  Log_in      Mathew
       3  Log_out     Kent

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM