简体   繁体   English

如何从 haphazard.dat 文件创建 Pandas df?

[英]How to create a Pandas df from a haphazard .dat file?

I have a.dat file that looks like this.我有一个看起来像这样的 .dat 文件。

6.74E+01  "methane"                                        "74-82-8"     "L"
5.06E+01  "ethane"                                         "74-84-0"     "L"
7.16E+01  "propane"                                        "74-98-6"     "L"
9.59E+01  "butane"                                         "106-97-8"    "L"
1.20E+02  "2-methylpropane"                                "75-28-5"     "L"
3.73E+02  "dimethylpropane"                                "463-82-1"    "L"
1.25E+02  "pentane"                                        "109-66-0"    "L"

This.dat file appears to be haphazardly created. This.dat 文件似乎是随意创建的。 As far as I can tell, the columns are separated by varying numbers of spaces.据我所知,这些列由不同数量的空格分隔。 Further down the file, some rows also have one extra column for comments.在文件的下方,一些行还有一个额外的注释列。 I need to read this into a Pandas dataframe.我需要将其读入 Pandas dataframe。 I have tried...我努力了...

raw = pd.read_table(r'FILE PATH')
raw.columns = ['Value', 'Name', 'Numbers', 'Letter']

Which then throws an error saying "Exception has occurred: ValueError Length mismatch: Expected axis has 1 elements, new values have 4 elements"然后引发错误说“发生异常:ValueError 长度不匹配:预期轴有 1 个元素,新值有 4 个元素”

I was expecting an error, but this makes it look like there is only 1 column.我期待一个错误,但这使它看起来只有 1 列。 I am totally at a loss and I hope someone can help.我完全不知所措,我希望有人能提供帮助。 Thanks谢谢

Edit: The extra columns have a single space of separation.编辑:额外的列有一个分隔空间。

1.01E-02  "2,3-benzindene"                                 "86-73-7"     "M" ! fluorene

Assuming that columns are defined by runs of whitespace, you can use the delim_whitespace=True argument of read_table .假设列是由空格定义的,您可以使用 read_table 的read_table delim_whitespace=True参数。

I assume that the file does not contain a header line.我假设该文件不包含 header 行。 By specifying the column names through the names argument, you avoid a) that the first line is interpreted as a header line and b) that the parser is confused by the "extra columns".通过names参数指定列名,您可以避免 a) 第一行被解释为 header 行和 b) 解析器被“额外列”混淆。

raw = pd.read_table(filename, delim_whitespace=True,
                    names=['Value', 'Name', 'Numbers', 'Letter'])

Result of print(raw) : print(raw)

      Value             Name   Numbers Letter
0   67.4000          methane   74-82-8      L
1   50.6000           ethane   74-84-0      L
2   71.6000          propane   74-98-6      L
3   95.9000           butane  106-97-8      L
4  120.0000  2-methylpropane   75-28-5      L
5  373.0000  dimethylpropane  463-82-1      L
6  125.0000          pentane  109-66-0      L
7    0.0101   2,3-benzindene   86-73-7      M

You can try to open the file and load the data manually.您可以尝试打开文件并手动加载数据。 I'm using standard shlex module to get rid of the quotes:我正在使用标准的shlex模块来摆脱引号:

import shlex
import pandas as pd


data = []
with open('your_file.dat', 'r') as f_in:
    for line in f_in:
        line = line.strip()
        if not line:
            continue
        data.append(shlex.split(line)[:4])

df = pd.DataFrame(data, columns=['Value', 'Name', 'Numbers', 'Letter'])
print(df)

Prints:印刷:

      Value             Name   Numbers Letter
0  6.74E+01          methane   74-82-8      L
1  5.06E+01           ethane   74-84-0      L
2  7.16E+01          propane   74-98-6      L
3  9.59E+01           butane  106-97-8      L
4  1.20E+02  2-methylpropane   75-28-5      L
5  3.73E+02  dimethylpropane  463-82-1      L
6  1.25E+02        pentane 2  109-66-0      L
7  1.01E-02   2,3-benzindene   86-73-7      M

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM