简体   繁体   中英

Parallel processing or optimization of latent class analysis

I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected.

Is there a way for me to utilize parallel processing to run this more efficiently? Otherwise, is there other ways to optimize this?

#Converting binary variables to 1 and 2
lca_dat1=lca_dat1+1
#Formula for LCA
f<-cbind(Abdominal_hernia,Abdominal_pain,
     Acute_and_unspecified_renal_failure,Acute_cerebrovascular_disease,
     Acute_myocardial_infarction,Administrative_social_admission,
     Allergic_reactions,Anal_and_rectal_conditions,
     Anxiety_disorders,Appendicitis_and_other_appendiceal_conditions,
     Asthma,Bacterial_infection_unspecified_site,
     Biliary_tract_disease,Calculus_of_urinary_tract,
     Cancer_of_breast,Cardiac_dysrhythmias,
     Cataract,Chronic_obstructive_pulmonary_disease_and_bronchiectasis,
     Chronic_renal_failure,Chronic_ulcer_of_skin,
     Coagulation_and_hemorrhagic_disorders,Coma_stupor_and_brain_damage,
     Complication_of_device_implant_or_graft,Complications_of_surgical_procedures_or_medical_care,
     Conditions_associated_with_dizziness_or_vertigo,Congestive_heart_failure_nonhypertensive,
     Coronary_atherosclerosis_and_other_heart_disease,Crushing_injury_or_internal_injury,
     Deficiency_and_other_anemia,Delirium_dementia_and_amnestic_and_other_cognitive_disorders,
     Disorders_of_lipid_metabolism,Disorders_of_teeth_and_jaw,
     Diverticulosis_and_diverticulitis,E_Codes_Adverse_effects_of_medical_care,
     E_Codes_Adverse_effects_of_medical_drugs,E_Codes_Fall,
     Epilepsy_convulsions,Esophageal_disorders,
     Essential_hypertension,Fever_of_unknown_origin,
     Fluid_and_electrolyte_disorders,Fracture_of_lower_limb,
     Fracture_of_upper_limb,Gastritis_and_duodenitis,
     Gastroduodenal_ulcer_except_hemorrhage,Gastrointestinal_hemorrhage,
     Genitourinary_symptoms_and_illdefined_conditions,Gout_and_other_crystal_arthropathies,
     Headache_including_migraine,Heart_valve_disorders,
     Hemorrhoids,Hepatitis,Hyperplasia_of_prostate,
     Immunizations_and_screening_for_infectious_disease,
     Inflammation_infection_of_eye_except_that_caused_by_tuberculosis_or_sexually_transmitteddisease,Inflammatory_diseases_of_female_pelvic_organs,
     Intestinal_infection,Intracranial_injury,
     Joint_disorders_and_dislocations_traumarelated,Late_effects_of_cerebrovascular_disease,
     Medical_examination_evaluation,Menstrual_disorders,
     Mood_disorders,Nausea_and_vomiting,
     Neoplasms_of_unspecified_nature_or_uncertain_behavior,Nephritis_nephrosis_renal_sclerosis,
     Noninfectious_gastroenteritis,Nonspecific_chest_pain,
     Nutritional_deficiencies,Open_wounds_of_extremities,
     Open_wounds_of_head_neck_and_trunk,Osteoarthritis,
     Other_aftercare,Other_and_unspecified_benign_neoplasm,
     Other_circulatory_disease,
     Other_connective_tissue_disease,
     Other_diseases_of_bladder_and_urethra,Other_diseases_of_kidney_and_ureters,
     Other_disorders_of_stomach_and_duodenum,Other_ear_and_sense_organ_disorders,
     Other_endocrine_disorders,Other_eye_disorders,
     Other_female_genital_disorders,Other_fractures,
     Other_gastrointestinal_disorders,Other_infections_including_parasitic,
     Other_injuries_and_conditions_due_to_external_causes,Other_liver_diseases,
     Other_lower_respiratory_disease,Other_nervous_system_disorders,
     Other_nontraumatic_joint_disorders,Other_nutritional_endocrine_and_metabolic_disorders,
     Other_screening_for_suspected_conditions_not_mental_disorders_or_infectious_disease,
     Other_skin_disorders,Other_upper_respiratory_disease,
     Other_upper_respiratory_infections,Paralysis,
     Pleurisy_pneumothorax_pulmonary_collapse,Pneumonia_except_that_caused_by_tuberculosis_or_sexually_transmitted_disease,
     Poisoning_by_other_medications_and_drugs,Respiratory_failure_insufficiency_arrest_adult,
     Retinal_detachments_defects_vascular_occlusion_and_retinopathy,Screening_and_history_of_mental_health_and_substance_abuse_codes,
     Secondary_malignancies,Septicemia_except_in_labor,
     Skin_and_subcutaneous_tissue_infections,Spondylosis_intervertebral_disc_disorders_other_back_problems,
     Sprains_and_strains,Superficial_injury_contusion,
     Syncope,Thyroid_disorders,Urinary_tract_infections)~1
#LCA for 1 class
lca1<-poLCA(f,lca_dat1,nclass=1,maxiter=3000,tol=1e-7,graph=F,nrep=5)
#LCA for 2 classes
lca2<-poLCA(f,lca_dat1,nclass=2,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca2=lca2$posterior
posterior_lca2$max_pos=apply(posterior_lca2,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca2$max_pos>0.7)
#LCA for 3 classes
lca3<-poLCA(f,lca_dat1,nclass=3,maxiter=3000,tol=1e-7,graph=T,nrep=5)
##Extract maximum posterior probability
posterior_lca3=lca3$posterior
posterior_lca3$max_pos=apply(posterior_lca3,1,max)
##Check number of maximum posterior probability that falls above 0.7
table(posterior_lca3$max_pos>0.7)
...

You can create a list with the different configurations you want to use. Then use either one of the *apply functions from the parallel package or %dopar% from foreach . Which parallel backend you can/should use depends on your OS.

Here an example with foreach :

library(foreach)
library(doParallel)
registerDoSEQ() # proper backend depends on the OS
foreach(nclass = 1:10) %dopar% { 
  # do something with nclass
  sqrt(nclass)
}

Here are my not too brief or compact thoughts on this. They are less than exact. I have not ever used anywhere near so many manifest factors with poLCA and I think you may be breaking some interesting ground doing so computationally. I use poLCA to predict electoral outcomes per voter (red, blue, purple). I can be wrong on that and not suffer a malpractice suit. I really don't know about the risk of LCA use in health analysis. I think of LCA as more of a social sciences tool. I could be wrong about that as well. Anyway:

(1) I believe you want to look for the most "parsimonious" factors to produce a latent class and limit them to a reduced subset that proves the most useful for all your data. That will help with CPU optimization. I have found personally that using manifests that are exceptionally "monotonic" is not ( by default ) necessarily a good thing, although certainly experimenting with factors more or less "monotonic" talks to you about your model.

I have found it is more "machine learning" friendly/responsible to use the most widespread manifests and "sample split" your data into groups; recombining the posteriors post LCA run. This assumes that that the most widespread factors affect different subgroups quantitatively but with variance for sample groups (eg red, blue, purple). I don't know that anyone else does this, but I gave up trying to build the "one LCA model that rules them all" from voterdb information. That didn't work.

(2)The poLCA library (like most latent class analysis) depends upon matrix multiplication. I have found poLCA more CPU bound than memory bound but with 114 manifests you may experience bottlenecks at every nook and cranny of your motherboard. Whatever you can do to increase matrix multiplication efficiency helps. I believe I have found that Microsoft Open R use of Intel's MKL MKL s more efficient than the default CRAN numeric library. Sorry, I haven't completely tested that nor do I understand why some numeric libraries might be more efficient than others for matrix multiplication. I only know that Microsoft Open R brags about this some and it appears to me they have a point with MKL MKL .

(3) Reworking your LCS code into Matt Dowles data.table library shows me efficiencies across the board on all my work. I create 'dat' as an data.table and run iterations for the best optimized data.table function for poLCA and posteriors. Combining data.table efficiency with some of Hadley Wickham's improved *ply functions (plyr library) that puts LCA runs into lists works well for me:

rbindlist(plyr::llply(1:10,check_lc_pc)) # check_lc_pc is the poLCA function.

(4) This is a simple tip (maybe even condescending), but you don't need to list all standard error data once you are satisfied with your model so verbose = FALSE. Also, by making regular test runs, I can determine the poLCA run optimizated best for my model the best ('probs.start') and leverage testing thereof:

lc <- poLCA(f,dat,nrep=1,probs.start=probs.start.new,verbose=FALSE)

poLCA produces a lot of output to the screen by default. Create a poLCA function with verbose=FALSE and a byte-compiled R 3.5 optimizes output.

(5) I use Windows 10 and because of fast SSD, fast DDR, and Microsoft "memory compression" I think I notice that the the Windows 10 OS adapts to lca runs with lots of "memory compression". I assume that it is holding the same matrices in compressed memory because I am calling them repeatedly over time. I really like the Kaby Lake processors that "self over-clock". I see my processor 7700HQ taking advantage of that during LCA runs. (It would seem like LCA runs would benefit from over clocking. I don't like to overclock my processor on my own. That's too much risk for me.) I think it is useful to monitor memory use of your LCA runs from another R console with system calls to Powershell and cmd memory management functions. The one below list the hidden "Memory Compression" process(!!):

ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
$t2 = $t1 | Select { 
 $_.Id;
 [math]::Round($_.WorkingSet64/1MB);
 [math]::Round($_.PrivateMemorySize64/1MB);
 [math]::Round($_.VirtualMemorySize64/1MB) };
$t2 | ft * "); }
ps_all <- function() {ps();ps_e();ps_f();}

I have this memory management function for your session used for the lca runs, but of course, that runs before or after:

memory <- function() {
as.matrix(list(
paste0(shell('systeminfo | findstr "Memory"')), # Windows
paste0("R Memory size (malloc) available: ",memory.size(TRUE)," MB"),
paste0("R Memory size (malloc) in use: ",memory.size()," MB"),
paste0("R Memory limit (total alloc): ",memory.limit()," MB")

There is work on the optimization functions for latent class analysis. I post a link here but I don't think that helps us today as users of poLCA or LCA: http://www.mat.univie.ac.at/~neum/ms/fuchs-coap11.pdf . But maybe the discussion is good background. There is nothing simple about poLCA. This document by the developers: http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf is worth reading at least twice!

If anyone else has any thoughts of poLCA or LCA compression, I would appreciate further discussion as well. Once I started predicting voter outcomes of entire state as opposed to my county, I had to think about optimization and the limits of poLCA and LCA/LCR.

Nowadays, there is a parallized cpp-based impementation of poLCA, named poLCAParallel in https://github.com/QMUL/poLCAParallel . For me, it was much much faster than the base package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM