Sunday, March 24, 2013

High Kernel CPU Usage - Grrr!

My poor old desktop machine, Sleipnir, is much abused and overloaded. It's maxed out, with 3 GB of RAM (max usable for 32-bit XP),  300 GB IDE C: drive, and 1.5 TB and 2 TB drives for use with my SageTV software for DVD images and recorded TV shows, respectively.

For a few weeks now, the poor old thing has been dragging her feet. Everything was slow; menus would take many seconds, even a minute, to appear, programs were slow to load, and once RAM was fully committed, any switching of programs that involved the swap file - and with Firefox's memory leaks, that usually didn't take long to occur - was painful.

I didn't think too much of it; it's well known that Windows machines degrade over time. I've always put it down to registry rot, coupled with Microsoft's unholy alliance with hardware manufacturers that gives them an incentive to drive users to replace their computers frequently.

But it got to be a major Pain In The Ass. My work was slowing down; Eclipse was dragging along and even simple edits were getting to be painful. Worse still, TV recordings were becoming corrupted. Sleipnir contains three TV tuners, and we rely on the SageTV software to automatically record TV shows so that we can watch them at a convenient time. Downstairs, our main TV has a Sage HD-300 extender which allows us to view recordings or live TV, and we count heavily on this to allow us to watch our favourite shows when our workload allows. In fact, the TV won't work without it as there is no external antenna and simple rabbits-ears don't get a usable signal in that location - my upstairs office has much better reception.

However, now both recorded and "live" TV was jittering, dropping out and downright corrupted. At the least, there were occasional ear-shattering chirps; at worst, shows were just unwatchable. The pressure was on to either replace the computer or get the problem fixed.

So I did a little hunting around. Sleipnir is so heavily loaded that I routinely run the Task Manager to keep an eye on it, and it was already obvious that the CPU Usage display was showing most of the time spent in the kernel. At the same time, the hard drive activity light was solidly on. Hmmm. Disk activity involving lots of CPU? That shouldn't be happening. (You have to imagine me stroking my chin, thoughtfully at this point). Usually, disk I/O is handled by the DMA Controller, which transfers sectors (or more) directly from the disk controller buffers into main memory with no CPU intervention. The CPU hasn't been involved since the good old days of ...

PIO! Programmed I/O - where the processor itself enters a loop to transfer data, word by word (it used to be byte by byte, in the old days) from the disk controller into main memory.

Could it be? Opening Device Manager (from within the "My Computer" properties) and examining the "Primary IDE Channel" properties, "Advanced Settings" tab soon revealed that yes, indeed! - the "Current Transfer Mode" as set to "PIO" rather than the expected "Ultra DMA Mode 5". It turns out that if Windows experiences 6 or more CRC (Cyclic Redundancy Check) errors while reading a drive, it degrades the DMA mode setting, eventually getting to zero and then reverting to PIO mode. This won't actually help anything - the problem is with the disk drive, not the controller - but of course, the CPU is now having to work hard during disk transfers and it slows everything right down.

IDE Properties - if the "Current Transfer Mode" is "PIO" you're in trouble.
Simply setting the "Transfer Mode" to "DMA if available" won't reset things. Rather, you have to click on the "Driver" tab and - yes, this is correct - uninstall the driver. This is a considerable leap of faith, especially considering that this is Windows we're talking about here, people. In fact, you have to uninstall the driver on all IDE channels, and then reboot.

On rebooting, Windows will produce "New hardware discovered" messages and will reinstall the drivers. It did for me, and it should for you, too. If you haven't uninstalled the driver on all channels, then you'll probably find it's still running in PIO mode on the problematic channel. If you have uninstalled the driver on all channels, you might have to reboot yet again.

If this works for you, you should be back on the air with a decently-performing machine. It certainly worked, in my case. However, it's probably only a matter of time before the problem arises again - if there were six CRC errors on a drive, it may well be failing. In my case, I have a spare 320 GB IDE drive on the shelf - being a hardware hacker, I have spares for most things - and so I'll take care to back up anything vital and swap drives when I get time. All my work is stored on a server with RAID array and offsite backup or backed up to multiple machines and in the cloud anyway, and my iTunes library is also backed up to a pair of external hard drives rotated weekly to an off-site location. So I'm willing to sit and wait, in the interests of seeing how long it takes for the required six CRC errors to accumulate.

In the meantime, everything is so much snappier. The "All Programs" menu appears in less than a second rather than anything up to a minute, and I can watch TV while recording three programs simultaneously and running Eclipse, Thunderbird and Firefox.

Life is good again!