简体   繁体   English

使用正则表达式将DataFrame列拆分为列?

[英]Spliting a DataFrame column to columns with regex?

I have a DataFrame with two columns the second one has following format : 我有一个包含两列的DataFrame,第二列具有以下格式:

1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }
5 {{continuity 7.0923e-02} {x-velocity 2.2674e-03} {y-velocity 1.2308e-02} 

And I'm trying to use regex to split it into columns, by getting the first number, then getting all the numbers in between the brackets "{}" and give them the following names: 我正在尝试使用正则表达式将其分为几列,方法是获取第一个数字,然后将方括号“ {}”之间的所有数字都命名为以下名称:

names=['iter', 'x', 'x-vel', 'y-vel', 'z-vel', 'energy']

However I just don't seem to make the regular expression work, here's what i'm doing in a simple example: 但是我似乎只是没有使正则表达式起作用,这是我在一个简单示例中所做的事情:

Input 输入项

>>> a = "1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }"
>>> re.findall("(\d*) {*\{\D*(.*?)\}", a)

Result 结果

 >>> [('1', '1.0000e+00'), ('', '0.0000e+00'), ('', '4.4010e-02'), ('', '9.5681e-04'), ('', '1.1549e-07')]

As you can see my regex keeps looking for a number for every {} occurrence, but I don't want that to happen, how to do so? 如您所见,我的正则表达式每次出现{}时都会寻找一个数字,但是我不希望这样,那么该怎么做呢?

Expected Behavior 预期行为

 >>> [('1'), ('1.0000e+00'), ('0.0000e+00'), ('4.4010e-02'), ('9.5681e-04'), ('1.1549e-07')]

When my regular expression works, I'm trying to assign all the columns with a line that would look something like this: 当我的正则表达式工作时,我正在尝试为所有列分配一行,看起来像这样:

df[names] = df.first.str.extract(r'(\d*) {*\{\D*(.*?)\}', expand=True)

I'm really new to dataframes, is this the correct approach for this problem? 我真的是数据框的新手,这是解决此问题的正确方法吗?

Any help would be much appreciated, thanks in advance! 任何帮助将不胜感激,在此先感谢!

First, let's make a series from some data in the question. 首先,让我们从问题中的一些数据中得出一系列数据。

import pandas as pd    

data = pd.Series('''\
1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }'''
          .split('\n'))
print(data)

0    1 {{continuity 1.0000e+00} {x-velocity 0.0000e...
1    2 {{continuity 1.0000e+00} {x-velocity 7.8788e...
2    3 {{continuity 2.3250e-01} {x-velocity 1.6896e...
3    4 {{continuity 8.0243e-02} {x-velocity 2.2180e...
dtype: object
       0     

The first option is a simple regex to find all numbers in order. 第一个选项是一个简单的正则表达式,用于按顺序查找所有数字。 Use extractall to find every match in each string. 使用extractall查找每个字符串中的每个匹配项。 This may be good enough. 这可能已经足够了。 You still have to name the columns, which isn't hard. 您仍然必须命名列,这并不难。 This will have a MultiIndex (which is a little more advanced), since each match could have multiple groups (but this regex only has one group), hence the need to .unstack() it. 这将具有一个MultiIndex (稍微先进一点),因为每个匹配项可以具有多个组(但是此正则表达式只有一组),因此需要对其进行.unstack()

print(data.str.extractall(r'(\d[\d.e+-]*)').unstack())

match  0           1           2           3           4           5
0      1  1.0000e+00  0.0000e+00  4.4010e-02  9.5681e-04  1.1549e-07
1      2  1.0000e+00  7.8788e-04  1.2617e+01  9.0445e-04  4.5605e-06
2      3  2.3250e-01  1.6896e-03  1.2536e-02  9.8980e-03  3.4032e-06
3      4  8.0243e-02  2.2180e-03  1.3189e-02  1.0225e-02  4.6336e-07     

Alternatively, you can use a named groups regex. 或者,您可以使用命名组正则表达式。 This is a fairly straightforward template to build from one of the strings. 这是从其中一个字符串构建的相当简单的模板。 This will put the names from the regex into the columns. 这会将正则表达式中的名称放入列中。 But the names must be valid Python identifiers. 但是名称必须是有效的Python标识符。 ( x_vel , not x-vel ). x_vel ,不是x-vel )。 But this is probably what you want anyway, since it lets you access the columns as attrs (like df.x_vel instead of df['x-vel'] ). 但这仍然可能是您想要的,因为它允许您以attrs的方式访问列(例如df.x_vel而不是df['x-vel'] )。 This (?P<foo>...) named group syntax is all explained in the re module docs. 这个(?P<foo>...)命名的组语法在re模块文档中都有解释。

print(
    data.str.extract(r'(?P<iter>\d+) {{continuity (?P<x>[^}]+)} {x-velocity (?P<x_vel>[^}]+)} {y-velocity (?P<y_vel>[^}]+)} {z-velocity (?P<z_vel>[^}]+)} {energy (?P<energy>[^}]+)} }',
                     expand=False))

  iter           x       x_vel       y_vel       z_vel      energy
0    1  1.0000e+00  0.0000e+00  4.4010e-02  9.5681e-04  1.1549e-07
1    2  1.0000e+00  7.8788e-04  1.2617e+01  9.0445e-04  4.5605e-06
2    3  2.3250e-01  1.6896e-03  1.2536e-02  9.8980e-03  3.4032e-06
3    4  8.0243e-02  2.2180e-03  1.3189e-02  1.0225e-02  4.6336e-07

Note that we're using extract instead of extractall here because there are multiple groups in the regex itself. 请注意,这里使用的是extract而不是extractall ,因为正则表达式本身有多个组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM