简体   繁体   English

使用文本作为 Pandas 中的列标题和列值将字符串拆分为列

[英]Split string to columns using text as column headers and column values in pandas

I have a df that has 1 column where each row contains a string.我有一个 df,它有 1 列,每行包含一个字符串。 It looks like this:它看起来像这样:

 df data in 9.14 out 9.66 type 0.0 in 9.67 out 9.69 type 0.0 in 9.70 out 10.66 type 0.0 in 10.67 out 11.34 type 2.0 in 11.35 out 12.11 type 2.0

I want to split the text of this column into multiple columns.我想将此列的文本拆分为多列。 I want to use the words [in, out, type] as column headers, and the values following each word as the row values.我想使用单词 [in, out, type] 作为列标题,并将每个单词后面的值作为行值。 The result will have 3 columns labeled in, out and type and will look like this:结果将有 3 列标记为输入、输出和类型,如下所示:

 df in out type 9.14 9.66 0.0 9.67 9.69 0.0 9.70 10.66 0.0 10.67 11.34 2.0 11.35 12.11 2.0

Thanks!谢谢!

If you know in advance what the words will be, and also can guarantee that there won't be any bad data, this is a simple str.extract problem, where you can construct a robust regular expression to capture each group, using named groups to create the DataFrame in a single pass.如果你事先知道单词是什么,并且也可以保证不会有任何坏数据,这是一个简单的str.extract问题,你可以构建一个健壮的正则表达式来捕获每个组,使用命名组一次性创建 DataFrame。 That regular expression for your sample data is contained in approach #2.示例数据的正则表达式包含在方法 #2 中。

However, for the sake of demonstration, it is better to assume that you might have bad data, and that you might not know in advance what your column names are.但是,为了演示起见,最好假设您可能有错误的数据,并且您可能事先不知道您的列名称是什么。 In that case, you can use str.extractall and some unstack ing.在这种情况下,你可以使用str.extractall和一些unstack ING。

Option 1选项1
extractall + set_index + unstack extractall + set_index + unstack

generic_regex = r'([a-zA-Z]+)[^0-9]+([0-9\.]+)'

df['data'].str.extractall(generic_regex).set_index(0, append=True)[1].unstack([0, 1])

0         in    out type
match      0      1    2
0       9.14   9.66  0.0
1       9.67   9.69  0.0
2       9.70  10.66  0.0
3      10.67  11.34  2.0
4      11.35  12.11  2.0

Option 2选项 2
Define an explicit regex and use extract定义显式正则表达式并使用extract


rgx = r'in\s+(?P<in>[^\s]+)\s+out\s+(?P<out>[^\s]+)\s+type\s+(?P<type>[^\s]+)'

df['data'].str.extract(rgx)

      in    out type
0   9.14   9.66  0.0
1   9.67   9.69  0.0
2   9.70  10.66  0.0
3  10.67  11.34  2.0
4  11.35  12.11  2.0

If you data separated evenly between name and value by white-spaces as in your sample , you may use split and str accessor with stride to construct the desired output如果您的数据在namevalue之间由空格均匀分隔,如在您的示例中,您可以使用splitstr访问器和 stride 来构造所需的输出

df1 = df['data'].str.split()
df_out = pd.DataFrame(df1.str[1::2].tolist(), columns=df1[0][0::2])

Out[1097]:
      in    out type
0   9.14   9.66  0.0
1   9.67   9.69  0.0
2   9.70  10.66  0.0
3  10.67  11.34  2.0
4  11.35  12.11  2.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM