Some of you might have noticed that this site was down all weekend. I didn't notice it until Saturday morning, at which point I was so beat from traveling last week that I didn't care. I tried to fix it for 5 minutes, then gave up figuring Keith (at KGB Internet) would be able to fix it. I sent him an e-mail asking for help - hoping he could kill a stray process or something. I didn't do any more investigating until Sunday - when it was still down. This site has been extremely stable for the last few months, so the fact that it all just stopped working really had (and still has) me puzzled.
I spent about an hour trying to get things to startup again, to no avail. The problem seemed to be related to the fact that the first time I'd try to hit Roller - it would peg the CPU and the number of database connections would start to skyrocket. Within seconds, the connection pool would become exhausted and Hibernate stack traces would begin to litter my logs. I spent hours doing this dance. At one point, I suspected a DOS attack - why else would the connections skyrocket for no reason?
While I was doing the site-fix dance, I created a read-only copy of the AppFuse Wiki at http://appfuse.org/wiki. When I tried to restart Tomcat on appfuse.org, I got stack traces from the JSPs. Upgrading to the latest stable release of JSPWiki (2.2.28) fixed the problem. However, when I'd start the "wiki" app, it would hang when I tried to request pages. Guessing that this was related to the data volume of pages (11.5 MB gzipped), I cleared out the wiki pages directory and started the wiki context with no data. That worked. I was then able to copy the pages back in and get everything back to normal.
Back to this site. I tried several things last night, including upgrading Roller to 1.2, upgrading Tomcat to 5.5.9 and upgrading the JDK to 5.0. Nothing worked, so I gave up and went to bed. This morning, I decided to try JDK 5 with Roller 1.2 and the instance of Tomcat (5.0.28) that's been working so well for the past few months. Still no dice. But I did manage to get the wiki to spit out the errors it did on appfuse.org. Remembering that this was caused by JDK 5, I went ahead and upgraded JSPWiki to 2.2.28 and did the empty directory dance to get the wiki started. Voila! - everything was fixed!
So that's my story. I still don't know the root cause, but I think the JDK 5 + JSPWiki 2.2.28 combination fixed the problem. I also suspect that something is different today than yesterday b/c the number of database connections didn't immediately spike like it did yesterday. I turned the search feature off as part of my site-fixing dance, so that's still off. I've been afraid to touch anything since I got it all working again. I'll try to turn it back on tonight. In the meantime, I need to figure out how to fix the startup problem with JSPWiki.
Update: I think I figured out the problem - particularly with JSPWiki. I was using JSPWiki's PageViewCountPlugin to track page hits. This file had grown to be just over 59 MB! Deleting the file allowed JSPWiki to start problem-free.