简体   繁体   中英

Given a single concatenated string, how to split Pandas DataFrame into columns by index?

Context

Data is coming in as a text file. There are methods that successfully clean, fill, and organize the data prior to this step, ensuring the data is always clean and correct. I cannot change how the data is coming in.

There is not any delimiter, or value that would separate the columns in the concatenated string. Because that, I would like to "split" the data into a DataFrame with columns using a known index from the concatenated string.

I need to only act on self.data[0] The input data does not have any headers. This is how I'm importing the data:

import numpy as np
import pandas as pd

in_file = "in/cool_data_dude.txt"

class MyClass:
    __init__(self, in_file):
        self.data = pd.read_csv(filepath_or_buffer=in_file, delim_whitespace=True, skiprows=0, header=None)
What I've Tried

I've been able to find documentation and Stack Overflow questions based on splitting DataFrame data in columns, but everything I've found is based around separating based on delimiters or string values , not index.

Input

Example of incoming file:

 ABCTTY3948573774777300000000000100000001000000100000001000003847774111 01AP38888 ABCTTY9991112000200000000000000100000001000000100000001000003337187298 01AP38889 DEFTTY1102938488888300000450000045000004500000450000045000004500000000 03JU40000

Expected Output

The indexes of where the splices should occur, will always be true and remain the same. Col1 is always [0] to [2], Col2 is always [3] to [6], etc.

Expected DataFrame output would be (these numbers are incorrect and do not line up with the above input):

 0 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 ... 1 ABC DEF 123 456 789 012 345 678 ... 2 ABC DEF 135 135 135 000 000 000 ... ...

Do not stress about counting correct index values based on the example given. If I know what the methods and split looks like, I can copy it with the correct indexes.

Take a look at pd.Series.str.slice : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html

You would apply it as self.data['col1'] = self.data.str.slice(start=0,stop=2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM