September 2007 - Posts

Using log-transform to avoid underflow problem incomputing posterior probabilities

Last month while I was doing the science intern, I worked with latent-class models and most of the time, we used the posterior class-membership probabilities as the criteria to put users into cohorts.  At first, I implemented some Bayesian approach for computing the posteriors on Stata. It was pretty straightforward, and it seemed to work fine... until we ran into a huge dataset. I then spent roughly half an hour on the output message and the source of the error was found to be an underflow problem in computing the posteriors. In particular, it is a problem with multiple choices (repeated observations) under local independence, when the join probability is a product of the probabilities of many seemingly independent events and the product eventually becomes to small to be represented even using a double precision floating point number.  Surprisingly, such problem occurs more often than I would have thought. To deal with the new dataset, I had to add some code to check for this problem, and implemented a new algorithm that utilize the log-transform when the underflow/overflow is detected.

While I did wrote quite a few documentation about my code, I had always wanted to summarize the work in a technical way. So I wrote a short paper today. 
posted by wenyang with 0 Comments

An incomplete list of not-that-frequently-used but handy tools in Linux

Lots of programs in the Unix/Linux tend to follow the philosophy of "do one thing, do it well", and they are often designed to work together to do complicated jobs. Therefore, if we know what tools are available out there, we can break down the tasks into a few steps and use the most suitable tool for each of them. Usually, finding a right tool for the work in hand might well worth the time, if (that's a big if) you can easily find one and learn to use it quickly.

Some tools we use almost everyday: ls, cd, cp, mv, rm, pwd, find, grep, alias, locate, history, ... I am familiar with their common usage, at least for the basic ones. For those advanced cases, I know where to look for help quickly.

Of course there are also powerful tools in my arsenal, such as Sed, Awk, and Emacs. I can do a lot of things with them. But there is still times when some of the following (small) tools come in handy. Unfortunately, since they are not used as often, I tend to forget their usage and need to dig into the details of manual pages, which is less desirable. So I'll try to give some typical examples of how they can be used.

cut - remove sections from each line of files
paste - merge lines of files
sort - sort lines of text files
tr - translate or delete characters
tac - concatenate and print files in reverse
wc - display a count of lines, words  and  characters  in  a file
uniq - report or filter out repeated lines in a file
......

I'll put the details in my wiki. The directly link is here.

posted by wenyang with 0 Comments

Stata scripts to automate Latent Gold processing, updated

After releasing the first version of the LG auto-processing scripts in late July, I had been constantly making improvements to them, and distributed five more major releases by the end of August. The last version is v6.4.

The most convenient feature I added to the latest version is a new master do-file (master_lg_all_test_thepit.do) which can perform all tasks under Windows with minimal human intervention. I made use of the Windows version of Stata, so that I could run LG from my Stata do-file, and no longer need to manually copy files back and forth.

There is actually one tricky thing, when our working directory is on a remote file server. To run LG from Stata, I wrote a batch file and used the shell command to execute it. The shell command will invoke a command line window (cmd.exe) to execute the command, but cmd.exe cannot start from a UNC path (such as \\some\path or //some/other/path). The shell command would fail to run the batch file if the current dir is a UNC path, although such a path would work fine in Stata. The solution is to change the current working dir to a dir which is local to the hard drive (something like C:\Temp), and run the batch file. Fortunately, we can specify full (UNC) paths in both the batch file and the LG model file (*.lgf), so that all I/O files can be accessed via UNC paths.


Other improvements along the upgrade history include
  • Supporting Class-Independent effect and Continuous Factors in LG model file.
  • Better support on "choice'' models, in addition to the original "rating" one.
  • Handling model parameter output files (*.lst) properly when there is no intercept or more than two intercepts.
  • Adding several options to make the code flexible for various running purposes
Updates about the SED script to process LG output is available at here.

posted by wenyang with 0 Comments