[ale] Errors & Celsius, WAS: Re: Spinrite, or BIOS, or something drops hdd error rate 5X

Ron Frazier atllinuxenthinfo at c3energy.com
Sat Jan 8 14:48:12 EST 2011


Paul,

I will admit that I don't know exactly what these numbers are
monitoring, but here are some observations.  On your sdb drive, both the
temperature and the airflow are in the same general range.  I'm running
a Spinrite diagnostic right now on my laptop, and the temperature it's
reporting on my drive is hovering around 44 Celsius or 111 Fahrenheit.
I think Spinrite screams and yells if it gets above 52 Celsius or so.
I'm not sure what the break point is as I've only seen the error once.

The two readings on your sda drive seem very different.  I note that the
airflow temperature is a good bit higher than the other drive.  The
apparent drive temperature of 112 degrees Celsius is VERY hot.
According to this chart I found at:

http://en.wikipedia.org/wiki/Temperature_conversion_formulas

112 degrees Celsius is, well, off the chart, but it's more than 230
degrees Fahrenheit.   If anything in your computer is that hot,
something is probably wrong.  As a comparison, the AMD CPU in my desktop
system will only tolerate 62 degrees Celsius and the Intel CPU in my
laptop will only Tolerate 100 - 110 degrees Celsius.  (Yes, I found out
that every CPU is different.)

I would check all the fans in the unit that you can get to.  Make sure
that they're running.  Check all intake ports to make sure they're not
covered with dust.  I have to vacuum off the outside of my computer case
where the fans are once or twice per year.  If you can get to the
drives, carefully put your hand near them and see if you feel heat.  If
not, you can try to touch them.  The main disk drive in my desktop,
which I'm writing this message on, is cool to the touch.  I have forced
air running over it.  (Granted, the computer is mostly idle right now.)
The smart monitor function of the Ubuntu disk utility is reading a drive
temperature of 31 degrees Celsius or 88 degrees Fahrenheit.

I've read that substantial temperature increases can mean a bad bearing
or servo.  If the reading is real, and the cooling systems are working,
I'd back up my data right away and stay very suspicious of the drive and
consider replacing it.

I had an interesting and frustrating experience with my dell laptop a
while back.  I would leave it running a virus scan or something over
night and come back to either find it off or find that it rebooted.  I
eliminated the possibility of Windows rebooting for patches, although
that can happen.  I found something to monitor CPU temperature and found
that, during times of activity, the CPU was hovering right around 90
degrees Celsius, right at the upper limit for that system.  I looked at
the fan ports and found nothing wrong.

Not willing to accept that the computer just couldn't keep its cool, I
decided to take off the cover and look at the heat sink.  This computer
has what I consider to be a poor cooling design.  The intake fan is on
the bottom, right on your leg if you have the machine on your lap.  Air
is drawn in and blown over a heat pipe through a narrow channel to a
radiator at the rear which is literally about 2 inches wide.  The
radiator has metal louvers with very narrow slits between them.

After very carefully removing the heat sink / heat pipe assembly, I
found a big glob of dust completely covering 3/4 of the INSIDE surface
of the radiator.  This was invisible from the outside.  I removed the
dust and brushed off the louvers.  Then, I carefully put everything back
together, making sure the heat pipe plates properly mated with the CPU
and another chip (north bridge?) that they were attached to.  Now, I can
run the machine as hard as I want and the CPU temp rarely gets above 80
degrees Celsius, which is still 20 degrees from its maximum limit.

I have another story about overheating.  I recently updated my desktop
from an AMD dual core CPU chip to an AMD 4 core chip.  I installed the
stock cooler that it came with.  At the time, I was running the Prime95
algorithm to help scientific research search for new prime numbers.
This runs the CPU at it's maximum limit.  The CPU temperature
immediately pegged at about 62 degrees Celsius and stayed there.  This
is the limit for this chip, and presumably, it was throttling to prevent
burning up.  The fan was running like a hurricane, but wasn't enough.

I did some research and went and got a Corsair H70 liquid cooling unit
with dual fans and a large radiator that looks like a miniature version
of what's in your car.  I've always shied away from liquid cooling for
computers since water and electricity don't mix, etc.  However, this is
a sealed unit and comes pre-filled with liquid.  There is a heat plate /
pump unit which attaches to the CPU and is powered from a fan port on
the motherboard.  That was a bit of a bear to attach mechanically.  You
have to set your BIOS to run the motherboard fan power supply at full
speed, so the pump is not deprived of energy.  The fans attached to the
radiator can run from the CPU fan port, and can run at variable speeds.

Although it was a pain to install, and I virtually had to take the whole
computer apart and put it back together again; I totally love the
cooling unit (as long as it doesn't leak).  At idle, the CPU temp is
only slightly above ambient temperature of the case inlet air.  At full
load, the CPU temperature rarely gets above 50 degrees Celsius, which is
still 12 degrees below it's maximum limit.  I couldn't believe the limit
was as low as it was, and that their stock air cooler was totally lame.

In any case, I'm now in love with liquid cooling for the CPU.  I'm not
as convinced that I need a liquid loop on all the other components.  My
case has 2 intake air fans and 3 exhaust fans, 4 if you count the one in
the power supply.  Everything inside seems to stay nice and happy.  My
one misgiving is that the liquid pump could fail and I might not know
it.  In this case, the CPU would rapidly overheat.  Any more, I'm always
keeping one eye on the temperature.  The MSI motherboard I have has an
LED display which shows the CPU temperature when it's not running POST,
etc.  It normally stays around 32 degrees Celsius.  At one point, I had
lm sensors running and showing the temperature on the screen, but some
Linux patches killed it and I haven't fixed it.  I think I have to
compile a kernel module or something.  For my other machines, I have a
widget on the Gnome panel or Windows panel that always shows my CPU
temperature and the clock speed for each core (Linux only).

I no longer run the Prime95 program all the time because I figure the
odds of me finding a world record prime number and winning a prize are
remote.  In the case of the desktop 4 core machine, it adds a non
trivial $ 7 / mo to my power bill, which amounts to 2%.  OK, it's not
earth shaking either.  ( .1 KW * 24 HR * 30 days = 72 KWH ~ $ 7.20 )  On
my laptop, running the CPU at 100% runs the fan like crazy.  That fan is
not user replaceable, at least not easily.  So, I've decided not to
devote my CPU cycles to this scientific project, although the concept is
still neat.  I still use Prime95 for stress testing the machine on
occasion.

Sincerely,

Ron

On Sat, 2011-01-08 at 08:08 -0500, Paul Cartwright wrote:
> On 01/07/2011 02:51 PM, Ron Frazier wrote:
> > Just thought I'd pass along some interesting results I'm getting while
> > running Spinrite (as discussed on prior thread "Which large capacity
> > drives are you having the best luck with?") on a new drive I just
> > bought.  The utility is doing a very intensive non destructive surface
> > analysis of the whole drive, using numerous read / write data patterns.
> I was just looking at my logs, and I'm not sure if it means anything,
> and I don't know the difference between Airflow_temperature &
> temperature celsius, but my MAIN drive temp seems to be twice that of my
> 2nd drive..
> there was no entry in the syslog for sda with raw_read_error_rate... nor
> the Hardware_ECC_Recovered.
> 
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sda, SMART Usage
> Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sda, SMART Usage
> Attribute: 194 Temperature_Celsius changed from 113 to 112
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sdb, SMART
> Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 99
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sdb, SMART Usage
> Attribute: 190 Airflow_Temperature_Cel changed from 56 to 55
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sdb, SMART Usage
> Attribute: 194 Temperature_Celsius changed from 44 to 45
> Jan  8 07:59:54 paulandcilla smartd[4605]: Device: /dev/sdb, SMART Usage
> Attribute: 195 Hardware_ECC_Recovered changed from 59 to 60
> 
> 
> 

-- 

(PS - If you email me and don't get a quick response, you might want to
call on the phone.  I get about 300 emails per day from alternate energy
mailing lists and such.  I don't always see new messages very quickly.)

Ron Frazier

770-205-9422 (O)   Leave a message.
linuxdude AT c3energy.com



More information about the Ale mailing list