[英]Reading text table with line-wrap into DataFrame
I want to read in a text file table as a DataFrame. 我想将文本文件表读取为DataFrame。
I have text files which contain representations of tables but there is some line-wrapping eg 我有包含表表示形式的文本文件,但是有一些换行符,例如
clock_name total_pwr leakage_pwr
NA* 3.0675e-05 3.0675e-05
CLK1 (1.3333e+02) 6.8333e-02 6.0083e-03
LONGCLKNAME (3.3333e+02)
2.5707e-03 2.0459e-04
LONGCLKNAME2 (3.3333e+02)
1.8777e-03 1.4462e-04
CLK2 (3.3333e+02) 1.4190e-03 1.1886e-04
CLK3 (3.3333e+02)
1.1038e-03 9.3498e-05
Currently I read the table into a string line by line and try to convert it directly to a DataFrame using to_csv. 目前,我将表格逐行读入字符串,并尝试使用to_csv将其直接转换为DataFrame。 The string will be:
字符串将是:
string = "clock_name total_pwr leakage_pwr \n\
NA* 3.0675e-05 3.0675e-05\n\
CLK1 (1.3333e+02) 6.8333e-02 6.0083e-03\n\
LONGCLKNAME\n\
(3.3333e+02) 2.5707e-03 2.0459e-04\n\
LONGCLKNAME2\n\
(3.3333e+02) 1.8777e-03 1.4462e-04\n\
CLK2 (3.3333e+02) 1.4190e-03 1.1886e-04\n\
CLK3 (3.3333e+02) 1.1038e-03 9.3498e-05"
So I've tried: 所以我尝试了:
df = pd.read_csv(StringIO(string), sep='\t')
and I want the following: 我想要以下内容:
clock_name total_pwr leakage_pwr
0 NA* 3.0675e-05 3.0675e-05
1 CLK1 (1.3333e+02) 6.8333e-02 6.0083e-03
2 LONGCLKNAME (3.3333e+02) 2.5707e-03 2.0459...
3 LONGCLKNAME2 (3.3333e+02) 1.8777e-03 1.446...
4 CLK2 (3.3333e+02) 1.4190e-03 1.1886e-04
5 CLK3 (3.3333e+02) 1.1038e-03 9.3498e-05
but get: 但得到:
clock_name total_pwr leakage_pwr
0 NA* 3.0675e-05 3.0675e-05
1 CLK1 (1.3333e+02) 6.8333e-02 6.0083e-03
2 LONGCLKNAME
3 (3.3333e+02) 2.5707e-03 2.0459e-04
4 LONGCLKNAME2
5 (3.3333e+02) 1.8777e-03 1.4462e-04
6 CLK2 (3.3333e+02) 1.4190e-03 1.1886e-04
7 CLK3 (3.3333e+02) 1.1038e-03 9.3498e-05
How do I work around the line-wrap? 我如何绕线包装?
Ok, I am going to present you a very ugly code and if your statement that provided example is representative it will work. 好的,我将向您展示一个非常丑陋的代码,如果您提供的示例示例具有代表性,则可以使用。 I can refactor the code, but later if so request.
我可以重构代码,但是以后可以要求重构。
import re
import pandas as pd
with open("data.txt", "r") as file:
data = file.read()
data = data.split("\n")
result = []
ind = 0
while ind < len(data):
if re.match(r"^[a-zA-Z].+\)$", data[ind]):
result.append(data[ind].strip() + data[ind + 1])
ind += 2
else:
result.append(data[ind])
ind += 1
dict_result = {}
for i, x in enumerate(result):
tmp = x.split()
if len(tmp) == 3:
dict_result[i] = tmp
if len(tmp) == 4:
dict_result[i] = [tmp[0] + tmp[1], tmp[2], tmp[3]]
df_final = pd.DataFrame(dict_result).T
col_names = df_final.iloc[0, :]
df_final.drop(0, axis=0, inplace=True)
df_final.columns = col_names
Here is the output generated by the above code: 这是上面的代码生成的输出:
Provided there are no surprises in your data (which is beyond what is posted) this should do the trick. 只要您的数据没有意外(超出发布的范围),就可以解决问题。 The code, ugly as it is, hope it helps :)
该代码虽然很丑陋,但希望能有所帮助:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.