简体   繁体   English

用换行将文本表读取到DataFrame中

[英]Reading text table with line-wrap into DataFrame

I want to read in a text file table as a DataFrame. 我想将文本文件表读取为DataFrame。

I have text files which contain representations of tables but there is some line-wrapping eg 我有包含表表示形式的文本文件,但是有一些换行符,例如

clock_name         total_pwr     leakage_pwr
NA*                3.0675e-05    3.0675e-05
CLK1 (1.3333e+02)  6.8333e-02    6.0083e-03
LONGCLKNAME (3.3333e+02)
                   2.5707e-03    2.0459e-04     
LONGCLKNAME2 (3.3333e+02)
                   1.8777e-03    1.4462e-04     
CLK2 (3.3333e+02)   1.4190e-03    1.1886e-04    
CLK3 (3.3333e+02)
                   1.1038e-03    9.3498e-05  

Currently I read the table into a string line by line and try to convert it directly to a DataFrame using to_csv. 目前,我将表格逐行读入字符串,并尝试使用to_csv将其直接转换为DataFrame。 The string will be: 字符串将是:

string = "clock_name         total_pwr     leakage_pwr    \n\
NA*                3.0675e-05    3.0675e-05\n\
CLK1 (1.3333e+02)  6.8333e-02    6.0083e-03\n\
LONGCLKNAME\n\
 (3.3333e+02)  2.5707e-03    2.0459e-04\n\
LONGCLKNAME2\n\
 (3.3333e+02)  1.8777e-03    1.4462e-04\n\
CLK2 (3.3333e+02)   1.4190e-03    1.1886e-04\n\
CLK3 (3.3333e+02)  1.1038e-03    9.3498e-05"

So I've tried: 所以我尝试了:

df = pd.read_csv(StringIO(string), sep='\t')

and I want the following: 我想要以下内容:

   clock_name         total_pwr     leakage_pwr
0        NA*                3.0675e-05    3.0675e-05
1        CLK1 (1.3333e+02)  6.8333e-02    6.0083e-03
2  LONGCLKNAME (3.3333e+02)  2.5707e-03    2.0459...
3  LONGCLKNAME2 (3.3333e+02)  1.8777e-03    1.446...
4       CLK2 (3.3333e+02)   1.4190e-03    1.1886e-04
5        CLK3 (3.3333e+02)  1.1038e-03    9.3498e-05

but get: 但得到:

  clock_name         total_pwr     leakage_pwr
0      NA*                3.0675e-05    3.0675e-05
1      CLK1 (1.3333e+02)  6.8333e-02    6.0083e-03
2                                      LONGCLKNAME
3           (3.3333e+02)  2.5707e-03    2.0459e-04
4                                     LONGCLKNAME2
5           (3.3333e+02)  1.8777e-03    1.4462e-04
6     CLK2 (3.3333e+02)   1.4190e-03    1.1886e-04
7      CLK3 (3.3333e+02)  1.1038e-03    9.3498e-05

How do I work around the line-wrap? 我如何绕线包装?

Ok, I am going to present you a very ugly code and if your statement that provided example is representative it will work. 好的,我将向您展示一个非常丑陋的代码,如果您提供的示例示例具有代表性,则可以使用。 I can refactor the code, but later if so request. 我可以重构代码,但是以后可以要求重构。

import re
import pandas as pd

with open("data.txt", "r") as file:
    data = file.read()


data = data.split("\n")


result = []
ind = 0

while ind < len(data):

    if re.match(r"^[a-zA-Z].+\)$", data[ind]):
        result.append(data[ind].strip() + data[ind + 1])

        ind += 2

    else:
        result.append(data[ind])
        ind += 1


dict_result = {}
for i, x in enumerate(result):
    tmp = x.split()
    if len(tmp) == 3:
        dict_result[i] = tmp
    if len(tmp) == 4:
        dict_result[i] = [tmp[0] + tmp[1], tmp[2], tmp[3]]


df_final = pd.DataFrame(dict_result).T


col_names = df_final.iloc[0, :]
df_final.drop(0, axis=0, inplace=True)
df_final.columns = col_names

Here is the output generated by the above code: 这是上面的代码生成的输出:

在此处输入图片说明

Provided there are no surprises in your data (which is beyond what is posted) this should do the trick. 只要您的数据没有意外(超出发布的范围),就可以解决问题。 The code, ugly as it is, hope it helps :) 该代码虽然很丑陋,但希望能有所帮助:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM