简体   繁体   English

带有正则表达式头定义的熊猫read_table

[英]pandas read_table with regex header definition

For the data file formated like this: 对于格式如下的数据文件:

("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001

I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step" . 我可以使用pd.read_table(filename, sep=' ', header=0) ,它将使所有内容正确, 除了第一个标题"Time Step"

Is there a way to specify a regex string for read_table() to use to parse out the header names? 有没有一种方法可以为read_table()指定一个正则表达式字符串来解析标题名称?

I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself. 我知道解决此问题的一种方法是仅使用regex创建要使用的read_table()函数的名称列表,但我认为可能/应该存在一种在导入本身中直接表达该名称的方法。

Edit: Here's what it returns as headers: 编辑:这是它作为标题返回的内容:

['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']

So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. 因此,实际上似乎不可能在pandas.read_table()函数中执行此操作。 Below is posted the actual solution I ended up using to fix the problem: 以下是我最终用来解决问题的实际解决方案:

import re

def get_headers(file, headerline, regexstring, exclude):
    # Get string of selected headerline
    with file.open() as f:
        for i, line in enumerate(f):
            if i == headerline-1:
                headerstring = line
            elif i > headerline-1:
                break

    # Parse headerstring
    reglist = re.split(regexstring, headerstring)

    # Filter entries in reglist
        #filter out blank strs
    filteredlist = list(filter(None, reglist)) 

        #filter out items in exclude list
    headerslist = []
    if exclude:
        for entry in filteredlist:
            if not entry in exclude:
                headerslist.append(entry)
    return headerslist

get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])

Code explanation: 代码说明:

get_headers() : get_headers()


Arguments, file is a file object that contains the header. 参数, file是包含标题的文件对象。 headerline is the line number (starting at 1) that the header names exist. headerline是标头名称存在的行号(从1开始)。 regexstring is the pattern that will be fed into re.split() . regexstring是将被送入re.split() Highly recommended that you prepend a r to the regex pattern. 强烈建议您在正则表达式模式前加上r exclude is a list of miscellaneous strings that you want to be removed from the headerlist. exclude是要从标题列表中删除的其他字符串的列表。

The regex pattern I used: 我使用的正则表达式模式:


First up we have the pipe ( | ) symbol. 首先,我们有竖线( | )符号。 This was done to separate both the "normal" split method (which is the " " ) and the other stuff that needs to be rid of (namely the parenthesis). 这样做是为了分离“常规”拆分方法(即" " )和需要删除的其他内容(即括号)。

Starting with the first group: (?:" ") . 从第一组开始:( (?:" ") We have the (...) since we want to match those characters in order. 我们有(...)因为我们想按顺序匹配这些字符。 The " " is what we want to match as the stuff to split around. " "是我们想要匹配的内容。 The ?: basically says to not capture the contents of the group. ?:基本上表示捕获组的内容。 This is important/useful as otherwise re.split() will keep any groups as a separate item. 这很重要/有用,因为否则re.split()会将任何组保留为单独的项目。 See re.split() in documentation. 请参阅文档中的re.split()

The second group is simply the other characters. 第二组仅仅是其他字符。 Without them, the first and last items would be '("Time Step' and 'flow-time)\\n' . Note that this causes \\n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact. 如果没有它们,则第一个和最后一个项目将是'("Time Step''flow-time)\\n' 。请注意,这会将\\n视为列表的单独条目。这就是为什么我们使用exclude在事实发生后解决这个问题的论点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM