July 2007 - Posts

Stata scripts to automate Latent Gold processing

Today I finished writting the second version of Latent Gold (LG) processing code (a combination of do-files, ado-files, shell-scripts and perl scripts). Compared with the first version I submitted during last week, the new version is more flexible and powerful.

Having a better understanding on how the "syntax" statement works in Stata, I have adjusted the signatures of several Stata programs to avoid hard-coded names for files (or variables). This makes it easier to be reused, and the end-user of the code can now have full control on the files' location without digging into the code to modify them manually.

Currently the code includes six key components. The main entry is master_lg.do, in which variables, I/O directories and filenames are defined. It takes an input Stata data file (*.dta) and exports necessary variables under consideration to a ASCII text file, and then calls xxx_create_lgf.ado to create the LG model file (*.lgf). Those two files are then used by LG. The master_lg.do file also generates a post-processing script (do-file), which can be used later to analyze the LG output. BTW, this is necessary for now, as both the Stata and our scripts run on Linux server, while LG only runs on Windows. So we have to manually start LG in another PC, and thus unable to fully automate the process. (If our NetOps can tweak some setting to make a remote execution possible, then our life will be easier...) Anyway, for now, we need to copy the LG input/ouput data files back and forth between a Linux server and a Windows PC.  Once the output from LG is obtained, the post-processing script then calls another two ado files: xxx_post_lg_summary.ado, which generates a comprehensible text report from the LG batch output (*.lst) file, and xxx_post_lg_score.ado, which loads a dataset (could be the input data or some new data!) and computes the prior/posterior probabilities of class membership and scores. These two ado-files in turn use a shell script (reformat_lg_output.sh, which currently is a one-line sed statement) and a perl script (latent_gold_reformat_cst.pl).  The perl script was originally developed by someone else and I only made minor changes to adapt it to extract the estimated coefficient matrices -- frankly, I do not know perl well enough to write it from scratch, and I would have write it in C++ or sed/awk scripts.


posted by wenyang with 1 Comments

Latent Gold notes

(Last updated: 7/23/2007)

We are using Latent Gold (LG) version 4.5, with Latent Gold Choice as an add-in. The official website only mentions LG v4.0, but new versions are still available from time to time. (And BTW, they have put demo versions and documents on http://www.statisticalinnovations.com/products/)

The latest version has partly support batch mode. To use it, we need to run in a command line window the following command:
lg45.exe FILENAME.lgf

LG intaller automatically creates a shortcut in the start menu for command line window, in which the path to LG directory is added to the environment (the %PATH% variable). The installer has a bug, however: It fails to detect where "cmd.exe" is located. It assumes it's under C:\Windows\, which works on most WinXP machines. Unfortunately, the default windows dir in Win2000 is C:\WINNT\, so the shortcut will not work until we manually fix it.

LG's model file use *.lgf as extention name. It is actually an ASCII text file. One can save the model file from the GUI version of LG (lg45win.exe). The structure of this model file is not documented ... yet. (For many software companies, the documentation work is always lagging behind the software development, isn't it?)  So I manually save different versions of model files with slightly different options, and compare them to see the grammar of the lgf file. Most settings are quite straightforward. I'm not exactly sure about "chnum", but it seems we can use the same list of independent variables as in "chvar". "outsect" is also a little bit complicated, but it should be just a matter of time, if we try every combinations, to find out all possible values and what they mean. Here's an example of something useful to us:

data='./0716b_rating_5.txt';
model:'0716b_rating_5'  rating 5 /
  toler=1e-008  tolem=0.01  tolran=1e-005  bayes=1  bayess2=1  bayeslat=1
  bayespoi=1
  iterem=250  iternr=50  itersv=50
  iterboot=500
  nseed=0  nseedboot=0
  nrand=10
  usemiss=No
  sewald=yes  dummy=no
 outsect=0x1c37
  outclstd  outpred  out='./0716b_rating_5.out';
chdes = 1;
dependent DEP_VAR;
replicate CASE_ID;
chvar INDEP_VAR1 INDEP_VAR2 iNDEP_VARn;
chnum
INDEP_VAR1 INDEP_VAR2 iNDEP_VARn;
covariate COV_VAR1 COV_VAR2 COV_VARm;
attr DEP_VAR ordinal ;
attr COV_VAR1
ordinal ;
attr COV_VAR2 ordinal ;
attr COV_VARm ordinal ;

In the above example, we have a 5-class rating model (line 2) which uses data file specified in the first line. Besides alphabetical letters and numbers, the model name can have '-', '_', and perhaps some other characters -- provided that the name is quoted (either in single or double quotes). Line 3 to Line 12 specify options corresponding to those you will see in the "Technical", "Output", and "ClassPred" tab of a choice model dialogue.  Line 13 I am not sure, but it seems leaving it as one is ok. Line 14 is the dependent variable; Line 15 is the case ID (same as in the "Variables" tab; used to tell observations of the same individual) and Line 15/16 is the independent variables (separated by space) It seems the GUI version will break these two lines into multiple lines to make sure each line has no more than 80 characters. But my tests have shown that this is not required. "covariate" defines the set of covariates variables. The last few lines start with "attr"; it tells LG whether the dependent variable and the covariates are nominal or ordinal.

The default output of LG batch mode includes a text file named as "*.lst", where "*" is the same as the one used in the model file "*.lgf". This file includes the output of most useful information, such as the estimated coefficients, some performance measures of the model (such as LL, AIC, BIC, BIC3), and the posterior probabilities of each class for each individual (if we select to output them).  Another output file will be created if in the model file there is a "out='xxxxx';" statement. It may include prior or posterior classification probabilities and individual coefficients, if such output is specified in the model file.


I use the following (one-line) sed statement to extract some useful information from the "*.lst" file. The output is redirect to some file. I can then open it in Excel, do some formatting work, and send them to my boss for discussions.

#!/bin/bash
sed -e '1,5p' -e '6,/^[^\t]\+\t\t/{s/^[^\t]\+\t\t//;p}' -e '/ = /,/AIC (based on LL)/{/ = /{x;p;p;x};/Log-prior/d;/Log-posterior/d;/^Chi-squared Statistics/,/^\t\{6,\}/d;p}' -e '/^Profile/{n;n;p}' -e '/^Parameters/,/^Importance/!d;/^Parameters/{x;p;x};/^Importance/d' $*


posted by wenyang with 0 Comments

Internship; Stata; Latent Gold

Early this month I became a "science intern" at a company known for its leading role in providing personalization solutions (such as recommendation of products, TV programs, movie DVDs, music, etc.).  So far most of my work involves developing models for recommending movies, estimating these models, finding the best one. Another task is to write small programs to automate (most) of the above processes so that we can use them for other modeling work.

Most of the analyses for model developments are done through STATA.  I have not used it before, but it is not difficult to learn. My colleagues gave me some online tutorials which are quite useful for beginners like myself. Also STATA has an on-line help system which is similar to the one in Matlab. With these documents and some examples I am able to read others' code and write my own. BTW, the only tricky thing for a beginner is to understand that in STATA "variable" and "macro", for some reason, are not defined in the same way as some languages that I am familiar with such as C/C++. In STATA, "variables" are kind of fields or columns in a table (or a relational database), while "macros" are in fact what are called variables in C/C++ , which are simply used as aliases or  handles to refer to  something else.

To see if the consumers can be classified into different market segments, we also try to estimate cohort-specific coefficients. For this purpose we use Latent Gold, a software package developed by Statistical Innovations Inc., a tool for Latent Class and Finite Mixture Modeling.



posted by wenyang with 0 Comments