简体   繁体   中英

BigQuery - creating one hot encoded vector in SQL based on column values

Working on the MIMIC-IV dataset, In a task to predict mortality using hospital admission and lab test data, I'm trying to create a one-hot vector of m most common lab kinds.

subject_id - representing patient

admission_id - representing a single admission

itemid - the kind of lab taken

current query:

SELECT
  a.hadm_id,
  a.subject_id,
  l.itemid,
  gender,
  count(*) as number_of_labs,
  admission_type as type,
  admission_location as loc,
  ethnicity,
  marital_status as ms,
  anchor_age as age,
  l.itemid IN (SELECT itemid
                FROM `{labevents}` as l
                GROUP BY itemid
                ORDER BY COUNT(itemid) DESC
                LIMIT 256) AS onehot,
  MAX(hospital_expire_flag) as died
FROM
  `{admissions_table}` as a
  INNER JOIN `{patients_table}` as p ON a.subject_id = p.subject_id
  INNER JOIN `{labevents}` as l ON l.subject_id = p.subject_id
  group by subject_id, a.hadm_id, gender, admission_type, admission_location, ethnicity, marital_status, anchor_age, l.itemid
  LIMIT 20

Ideally, I want to add to the 'onehot' column I created an array representing a one-hot vector of the m most common labs (in this case m=256).

data is credentialed access only, therefore I can't share it.

One possible approach would be to make a bin template and cross joining with the lab_index .

Below is a simple example of the approach. I believe out can add a filter of most frequent index at the temp table.

DECLARE bin_max INT64;
DECLARE bin_min INT64;

-- 1 to 7 subject_id with lab_index of 1:4, 2:2, 3:0, 4:1
CREATE TEMP TABLE dataset AS
SELECT 1 as subject_id, 1 as lab_index
UNION ALL SELECT 2 as subject_id, 1 as lab_index UNION ALL SELECT 3 as subject_id, 1 as lab_index  -- one hot index 0
UNION ALL SELECT 4 as subject_id, 1 as lab_index UNION ALL SELECT 5 as subject_id, 2 as lab_index  -- one hot index 1
UNION ALL SELECT 6 as subject_id, 2 as lab_index UNION ALL SELECT 7 as subject_id, 4 as lab_index  -- one hot index 3
;

SET bin_max = (SELECT MAX(lab_index) FROM dataset);
SET bin_min = (SELECT MIN(lab_index) FROM dataset);

WITH
empty_bin AS (
    SELECT *
    FROM UNNEST(GENERATE_ARRAY(0, bin_max - bin_min, 1)) AS bin
),
one_hot_index AS (
    SELECT
        subject_id, lab_index, bin, lab_index - bin_min,
        IF (lab_index - bin_min = bin, 1, 0) AS one_hot,
    FROM dataset
    CROSS JOIN empty_bin
)
SELECT
    subject_id, lab_index,
    STRING_AGG(CAST(one_hot AS STRING), "" ORDER BY bin) as one_hot,  -- <- vector format could be changed
FROM one_hot_index
GROUP BY subject_id, lab_index
ORDER BY subject_id
;
Sample Dataset

在此处输入图像描述

Result

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM