简体   繁体   English

Pandas 子字符串 DataFrame 列

[英]Pandas substring DataFrame column

I have a pandas DataFrame, with a column called positions , that includes string values with the syntax of the following examples:我有一个 Pandas DataFrame,其中有一列名为positions ,其中包含具有以下示例语法的字符串值:

[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]
[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]
[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]

I want to create four new columns in my pandas DataFrame, y_start , x_start , y_end , x_end , that are extractions of only the numbers.我想在我的 Pandas DataFrame 中创建四个新列, y_startx_starty_endx_end ,它们只是数字的提取。

Eg for the example of the first row, my new columns would have the following values:例如,对于第一行的示例,我的新列将具有以下值:

y_start = 49 y_start = 49
x_start = 44 x_start = 44
y_end = 78 y_end = 78
x_end = 31 x_end = 31

To summarise, I am looking to extract just the first, second, third, and four occurrence of numbers and save these to individual columns.总而言之,我希望只提取第一次、第二次、第三次和第四次出现的数字并将它们保存到单独的列中。

import pandas as pd
from ast import literal_eval

# dataframe
data = {'data': ["[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]", "[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]", "[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]"]}

df = pd.DataFrame(data)

# convert the strings in the data column to dicts
df.data = df.data.apply(literal_eval)

# separate the strings into separate columns
df[['start', 'end']] = pd.DataFrame(df.data.tolist(), index=df.index)

# use json_normalize to convert the dicts to separate columns and join the dataframes with concat
cleaned = pd.concat([pd.json_normalize(df.start).rename(lambda x: f'{x}_start', axis=1), pd.json_normalize(df.end).rename(lambda x: f'{x}_end', axis=1)], axis=1)

# display(cleaned)
   y_start  x_start  y_end  x_end
0       49       44     78     31
1        1       63      0     23
2       54        9     78      3

Convert string to object:将字符串转换为对象:

import ast
df['positions'] = df['positions'].apply(ast.literal_eval)

This is one way:这是一种方式:

df1 = pd.DataFrame.from_records(pd.DataFrame.from_records(df.positions)[0]).rename(columns={"x":"x_start", "y":"y_start"})    
df2 = pd.DataFrame.from_records(pd.DataFrame.from_records(df.positions)[1]).rename(columns={"x":"x_end", "y":"y_end"})
df_new = pd.concat([df1, df2], axis=1)

another, a little more concise:另一个,更简洁一点:

df1 = pd.DataFrame(df.positions.to_list())[0].apply(pd.Series).rename(columns={"x":"x_start", "y":"y_start"})
df2 = pd.DataFrame(df.positions.to_list())[1].apply(pd.Series).rename(columns={"x":"x_end", "y":"y_end"})
df_new = pd.concat([df1, df2], axis=1)

I don't know offhand the time or memory performance of how these methods compare.我不知道这些方法比较的时间或内存性能。

output (either method):输出(任一方法):

   y_start  x_start  y_end  x_end
0       49       44     78     31
1        1       63      0     23
2       54        9     78      3

Not so clean but the working way is to write a custom function and apply lambda assuming that all your rows follow the same pattern as provided in your question:不太干净,但工作方法是编写一个自定义函数并应用 lambda 假设您的所有行都遵循您的问题中提供的相同模式:

### custom function
def startEndxy(x):
    x = x.split(':')
    return x[1].split(',')[0].replace(' ', ''), x[2].split('},')[0].replace(' ', ''), x[3].split(',')[0].replace(' ', ''), x[4].split('}')[0].replace(' ', '')


### columns creations
df['y_start'] = df['positions'].apply(lambda x: startEndxy(x)[0])
df['x_start'] = df['positions'].apply(lambda x: startEndxy(x)[1])
df['y_end'] = df['positions'].apply(lambda x: startEndxy(x)[2])
df['x_end'] = df['positions'].apply(lambda x: startEndxy(x)[3])

It should give you this output: Output它应该给你这个输出:输出

First reconstruct your series首先重建你的系列

df = pd.DataFrame(df['position'].tolist()).rename(columns={0: 'starts', 1:'ends'})

              starts               ends
0  {'y': 54, 'x': 9}  {'y': 78, 'x': 3}
1  {'y': 1, 'x': 63}  {'y': 0, 'x': 23}
2  {'y': 54, 'x': 9}  {'y': 78, 'x': 3}

Then assign the start and end columns然后分配开始和结束列

starts = pd.DataFrame(df['starts'].tolist()).rename(columns={'y': 'y_start', 'x': 'x_start'})
ends = pd.DataFrame(df['end'].tolist()).rename(columns={'y': 'y_start', 'x': 'x_start'})

df = pd.concat([starts, ends], axis=1)

   y_start  x_start  y_end  x_end
0       54        9     78      3
1        1       63      0     23
2       54        9     78      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM