简体   繁体   中英

Extract value between first quotation marks in pandas data frame column

I have a sample data frame called df like below (the actual df has thousands of rows) where each element of column "Code" is a list (and each of these lists can have multiple elements):

在此处输入图像描述

I would like for each row to get the first code number between the quotation marks. Therefore, I would like the output for the data frame above to be:

在此处输入图像描述

Initially, I thought that all the codes are 4-digit numbers, therefore I tried this:

My_List = df['Code'].tolist()

Unique_Code =[]
for i in range(0, len(My_List)):
    k = My_List[i][2:5]
    Unique_Code.append(k)

df['Unique_Code'] = Unique_Code 

but this obviously works only in the case the code is 4-digit number.

Could you please help me in order to find a more efficient and univesral way to solve this problem? Many thanks

g=df.explode('code').groupby('id')['code'].first().to_frame()#explode and pick first item in each group
g['code']=g['code'].str.strip("''")#Proceed and strip the inverted comas from code

If the code values of your dataframe are just like python lists you can use eval() function to convert them to objects again; not just works for numbers you can use it on strings, functions etc.

Try this:

data = {
    'ID': ["1", "2", "3", "4"],
    'Code': ['["435"]', '["442244"]', '["etetetet"]', '["345666"]'],
}

data_frame = pd.DataFrame(data, columns=["ID", "Code"])
for index, each_row in data_frame.iterrows():
    id_column = each_row["ID"]
    code_row = eval(each_row["Code"])[0]
    print(code_row)

Just in one line:

codes = [eval(each_code) for each_code in df['Code'].tolist()]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM