Stata scripts to automate Latent Gold processing
Today I finished writting the second version of Latent Gold (LG) processing code (a combination of do-files, ado-files, shell-scripts and perl scripts). Compared with the first version I submitted during last week, the new version is more flexible and powerful.
Having a better understanding on how the "syntax" statement works in
Stata, I have adjusted the signatures of several Stata programs to avoid hard-coded names for files (or variables). This makes it easier to be reused, and the end-user of the code can now have full control on the files' location without digging into the code to modify them manually.
Currently the code includes six key components. The main entry is
master_lg.do, in which variables, I/O directories and filenames are defined. It takes an input Stata data file (*.dta) and exports necessary variables under consideration to a ASCII text file, and then calls
xxx_create_lgf.ado to create the LG model file (*.lgf). Those two files are then used by LG. The master_lg.do file also generates a post-processing script (do-file), which can be used later to analyze the LG output. BTW, this is necessary for now, as both the Stata and our scripts run on Linux server, while LG only runs on Windows. So we have to manually start LG in another PC, and thus unable to fully automate the process. (If our NetOps can tweak some setting to make a remote execution possible, then our life will be easier...) Anyway, for now, we need to copy the LG input/ouput data files back and forth between a Linux server and a Windows PC. Once the output from LG is obtained, the post-processing script then calls another two ado files:
xxx_post_lg_summary.ado, which generates a comprehensible text report from the LG batch output (*.lst) file, and
xxx_post_lg_score.ado, which loads a dataset (could be the input data or some new data!) and computes the prior/posterior probabilities of class membership and scores. These two ado-files in turn use a shell script (
reformat_lg_output.sh, which currently is a one-line
sed statement) and a perl script (
latent_gold_reformat_cst.pl). The
perl script was originally developed by someone else and I only made minor changes to adapt it to extract the estimated coefficient matrices -- frankly, I do not know perl well enough to write it from scratch, and I would have write it in C++ or sed/awk scripts.