Tuesday, October 19, 2004
Everything Sucks in Large Amounts
What I want to be doing right now, instead of writing this post, is writing some C++ code. Well, let me rephrase that. I want to be writing some high-speed numerical code. I honestly gave Java and C# a really honest chance last night, but it quickly became apparent that there just aren't any decent matrix math libraries for them. That's how much I wanted to avoid C++, I was willing to deal with the performance of Java for this problem. If you're about to snicker at the fact that I would consider those solutions to be "high performance," shut up until you read the rest of this. Besides, even Java would still be a major improvement over Matlab.
There's some Matlab code we have to do a Geweke-Hajivassiliou-Keane (GHK) simulation to find the probability that an event that is distributed multivariate normally is within a given set of bounds. This is actually a pretty neat thing to do. Because the pdf of a normal distribution cannot be integrated analytically, calculating the cdf requires numerical integration. Normally, this is fine, but when you get a multivariate normal distribution (that is, a vector random variable that is jointly normally distributed, as opposed to a vector of independently normal random variables), after three dimensions it becomes impractical to do this integration. It is illegal, in economics or statistics literature, to mention this problem without also mentioning that it is called "the curse of dimensionality."
The solution, in practical terms, is to generate random variables that are distributed according to the distribution you are interested in, and then see how many of those fall in the rectangle you are interested in. As you perform this iterative process, you will converge on the probability you are interested in. The GHK algorithm is an improvement on the "crude" version I just described that uses some clever weighting to get a better estimate with fewer random draws.
So, since the Matlab code was taking 31 hours to run on the dataset of interest, and despite the fact that my optimization efforts cut that down by two thirds in Matlab, I thought I'd try to see if I could improve on that with a lower-level language.
I searched long and hard for some Java or C# matrix math libraries. I found some, of course, but nothing I felt comfortable with investing time in, because they looked too half-baked or bit-rotten, or whatever. So, I finally came to terms with the fact that I was going to have to use C++, because I knew there were many math libraries freely available for it.
The first thing I realized is that while there are some good libraries for this stuff, they're not really for the uses I was thinking of. Many of them are for creating gigantic sparse matrices, and that was just too heavy-weight. I'd never see a speed improvement for a much smaller matrix like I was working with. Eventually I invested some time in a highly lauded library called Blitz++. It promised good speed and some very complete matrix classes.
However, after I'd familiarized myself with the library and its very flexible matrix implementations, I found that it had no matrix multiply routine. This struck me as a rather strange omission, and I blew over an hour poking around because I just couldn't believe it wouldn't have it. Of course, it really didn't. They offered some very clever hack involving their tensor notation and the associated operators which in practice executes in O(n^4) time, no thank you. So, I wrote my own matrix multiply routine, no big deal, I've done it a million times.
However, what quickly became a problem was all the C++ template hackery that the Blitz++ fellows think is so clever. It isn't. It's painful. Compiling my little 150 line program takes more than half a minute, due to all that template crap. If there's a simple bug, it could very well generate 40 pages of meaningless error messages because every phase of the compiler is totally confused. Using Blitz++ drove me slowly insane, and now I don't really know what to use. There are some other math libraries I found, but none of them have the specific operations that I was hoping to get pre-written, as far as I can tell. The more complete ones seem to promise more template nonsense as well.
It makes you want to use Java or C#, is really what it does. I'm sure the performance would still have been a major improvement over the entirely interpreted (and notoriously slow) Matlab, and I might well have been done by now because I could have just written the damn code, instead of screwing around with libraries all day.
There's some Matlab code we have to do a Geweke-Hajivassiliou-Keane (GHK) simulation to find the probability that an event that is distributed multivariate normally is within a given set of bounds. This is actually a pretty neat thing to do. Because the pdf of a normal distribution cannot be integrated analytically, calculating the cdf requires numerical integration. Normally, this is fine, but when you get a multivariate normal distribution (that is, a vector random variable that is jointly normally distributed, as opposed to a vector of independently normal random variables), after three dimensions it becomes impractical to do this integration. It is illegal, in economics or statistics literature, to mention this problem without also mentioning that it is called "the curse of dimensionality."
The solution, in practical terms, is to generate random variables that are distributed according to the distribution you are interested in, and then see how many of those fall in the rectangle you are interested in. As you perform this iterative process, you will converge on the probability you are interested in. The GHK algorithm is an improvement on the "crude" version I just described that uses some clever weighting to get a better estimate with fewer random draws.
So, since the Matlab code was taking 31 hours to run on the dataset of interest, and despite the fact that my optimization efforts cut that down by two thirds in Matlab, I thought I'd try to see if I could improve on that with a lower-level language.
I searched long and hard for some Java or C# matrix math libraries. I found some, of course, but nothing I felt comfortable with investing time in, because they looked too half-baked or bit-rotten, or whatever. So, I finally came to terms with the fact that I was going to have to use C++, because I knew there were many math libraries freely available for it.
The first thing I realized is that while there are some good libraries for this stuff, they're not really for the uses I was thinking of. Many of them are for creating gigantic sparse matrices, and that was just too heavy-weight. I'd never see a speed improvement for a much smaller matrix like I was working with. Eventually I invested some time in a highly lauded library called Blitz++. It promised good speed and some very complete matrix classes.
However, after I'd familiarized myself with the library and its very flexible matrix implementations, I found that it had no matrix multiply routine. This struck me as a rather strange omission, and I blew over an hour poking around because I just couldn't believe it wouldn't have it. Of course, it really didn't. They offered some very clever hack involving their tensor notation and the associated operators which in practice executes in O(n^4) time, no thank you. So, I wrote my own matrix multiply routine, no big deal, I've done it a million times.
However, what quickly became a problem was all the C++ template hackery that the Blitz++ fellows think is so clever. It isn't. It's painful. Compiling my little 150 line program takes more than half a minute, due to all that template crap. If there's a simple bug, it could very well generate 40 pages of meaningless error messages because every phase of the compiler is totally confused. Using Blitz++ drove me slowly insane, and now I don't really know what to use. There are some other math libraries I found, but none of them have the specific operations that I was hoping to get pre-written, as far as I can tell. The more complete ones seem to promise more template nonsense as well.
It makes you want to use Java or C#, is really what it does. I'm sure the performance would still have been a major improvement over the entirely interpreted (and notoriously slow) Matlab, and I might well have been done by now because I could have just written the damn code, instead of screwing around with libraries all day.
Comments:
Post a Comment