简体   繁体   中英

Python dataframe from text file

I have a text file of the following format:

//DATASET

..... unnecessary lines.....

TIMEUNITS SECONDS

    TS 0  1.98849600e+08
        3.30000000e-03    1.25400000e-02    5.88000000e-03    0.00000000e+00    0.00000000e+00
        5.88000000e-03    3.33000000e-03    2.16000000e-03    0.00000000e+00    0.00000000e+00
    TS 0  1.98853209e+08
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
        1.25400000e-02    5.88000000e-03    3.33000000e-03    0.00000000e+00    0.00000000e+00
    TS 0  1.98860419e+08
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e+00    0.00000000e+00
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
    TS 0  1.98864081e+08
        1.08000000e-03    8.70000000e-04    7.20000000e-04    0.00000000e+00    0.00000000e+00
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
    TS 0  1.98867619e+08
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e+00    0.00000000e+00

I am also attaching the sample text file named "D50.bc" in this link: https://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing .

There are no column names but it is fine to add column names ie, V1, V2 etc. In the real text file, there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns correspond to certain values(ie, velocity/shear stress, etc.) at that node

I want to extract all the data/tables under the lines "TS 0 XXX" into individual dfs based on their timestamps(TS) so that I can do column operations for each TS. The TS value XXXX which is in second can be added as a separate column in the dfs. Following is the pattern of output I want in a pandas dataframe:

#          V1      V2      V3 V4 V5 grp node        TS                      TS0
# 1.1 0.00330 0.01254 0.00588  0  0   1    1 198849600     TS 0  1.98849600e+08
# 1.2 0.00588 0.00333 0.00216  0  0   1    2 198849600     TS 0  1.98849600e+08
# 2.1 0.00000 0.00000 0.00000  0  0   2    1 198853209     TS 0  1.98853209e+08
# 2.2 0.01254 0.00588 0.00333  0  0   2    2 198853209     TS 0  1.98853209e+08
# 3.1 0.00333 0.00216 0.00108  0  0   3    1 198860419     TS 0  1.98860419e+08
# 3.2 0.00000 0.00000 0.00000  0  0   3    2 198860419     TS 0  1.98860419e+08
# 4.1 0.00108 0.00087 0.00072  0  0   4    1 198864081     TS 0  1.98864081e+08
# 4.2 0.00000 0.00000 0.00000  0  0   4    2 198864081     TS 0  1.98864081e+08
# 5.1 0.00000 0.00000 0.00000  0  0   5    1 198867619     TS 0  1.98867619e+08
# 5.2 0.00333 0.00216 0.00108  0  0   5    2 198867619     TS 0  1.98867619e+08

Here is the R code that solved the problem but I want a python code doing the same thing.

spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
  out <- read.table(text = S[-1], header = FALSE)
  out$grp <- grp
  out$node <- seq_len(nrow(out))
  TS <- trimws(strsplit(S[1], "\\s+")[[1]])
  out$TS <- as.numeric(TS[length(TS)])
  out$TS0 <- S[1]
  out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)

I'll give it a shot:

import pandas as pd

#open and read file
with open('D50.bc') as f:
    text = f.read()

#create list of TS groups
data_string = [i.splitlines() for i in text.split('TIMEUNITS SECONDS\n')[1].split('TS ')[1:]]

#create nested dictionary of TS values
data = [{'TS0': f'TS {i[0]}', 'TS': int(float((i[0].split()[-1]))), 'grp': n1, 'data':[{'data':{f'V{n3}': float(z) for n3, z in enumerate(x.split('    ')[1:])}, 'node':n2} for n2, x in enumerate(i[1:])]} for n1, i in enumerate(data_string)]

#load to dataframe and flatten nested dict
df = pd.DataFrame(d).explode('data')
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))

Output:

TS0 TS grp node V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 TS 0 1.98849600e+08 198849600 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 TS 0 1.98849600e+08 198849600 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 TS 0 1.98849600e+08 198849600 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 TS 0 1.98849600e+08 198849600 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 TS 0 1.98849600e+08 198849600 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

I wrote this with an eye towards speed and efficiency, with the goal of avoiding unnecessary file copies, and to avoid holding the entire file in memory at once. For the same reason, it uses the Pandas C engine to read the file. The logic ends up being more complex than the other answer.

#!/usr/bin/env python3
import io
import pandas as pd


def parse_chunk(current_ts_chunk, index, current_ts):
    buf = io.StringIO(''.join(current_ts_chunk))
    chunk = pd.read_csv(buf, delim_whitespace=True, header=None)
    chunk['group'] = index
    chunk['node'] = chunk.index + 1
    chunk['TS'] = current_ts.split()[2]
    chunk['TS0'] = current_ts
    return chunk


with open("D50.bc", "rt") as f:
    # Advance through file until you get to TIMEUNITS SECONDS
    for line in f:
        if line.startswith("TIMEUNITS SECONDS"):
            break
    chunk_list = []
    current_ts = None
    current_ts_chunk = []
    index = 1
    for line in f:
        if line.startswith("TS "):
            if current_ts is None:
                # First TS in file
                current_ts = line.strip()
            else:
                # Write out previous TS
                chunk = parse_chunk(current_ts_chunk, index, current_ts)
                chunk_list.append(chunk)
                # Get ready to accept current TS
                current_ts = line.strip()
                current_ts_chunk = []
                index += 1
        else:
            # This is a data line
            current_ts_chunk.append(line)
    if current_ts is not None:
        chunk = parse_chunk(current_ts_chunk, index, current_ts)
        chunk_list.append(chunk)

    df = pd.concat(chunk_list)

print(df)

which gives this result:

         0       1        2  ...  node              TS                   TS0
0  0.00000  0.0000  0.00000  ...     1  1.98849600e+08  TS 0  1.98849600e+08
1  0.00000  0.0000  0.00000  ...     2  1.98849600e+08  TS 0  1.98849600e+08
2  0.00000  0.0000  0.00000  ...     3  1.98849600e+08  TS 0  1.98849600e+08
3  0.00000  0.0000  0.00000  ...     4  1.98849600e+08  TS 0  1.98849600e+08
4  0.00000  0.0000  0.00000  ...     5  1.98849600e+08  TS 0  1.98849600e+08
5  0.00000  0.0000  0.00000  ...     6  1.98849600e+08  TS 0  1.98849600e+08
6  0.00000  0.0000  0.00000  ...     7  1.98849600e+08  TS 0  1.98849600e+08
7  0.00000  0.0000  0.00000  ...     8  1.98849600e+08  TS 0  1.98849600e+08

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM