[ale] fascinating data on temperature, including ATI / AMD Radeon gpu

Thu Apr 25 01:57:21 EDT 2013

On 4/25/2013 12:56 AM, Alex Carver wrote:
> On 4/24/2013 17:47, Ron Frazier (ALE) wrote:
>
>>
>> My opinion is that any solid state component in my system should be fine
>> if I stay at least 15 degrees below the maximum limits listed.
>> Mechanical devices (hdd's, optical drives, floppy drives) are a whole
>> other matter.
>>
>> In my opinion, with proper ventilation, the PC should be able to run
>> almost indefinitely at full load at Tmax - 15.  I don't believe I'm
>> shortening the life substantially.  Again, I could be wrong.
>
> Nope, you are causing damage and shortening the life (but you probably 
> will get useful life out of the chip).  Semiconductors do NOT like 
> heat, period.  The problem is that heat causes a slow degradation of 
> the P-N junctions and the metal contacts (diffusion).  The damage is 
> cumulative.  At low temperatures (below about 40 C) the diffusion is 
> very slow. Increasing the temperature accelerates the process along an 
> exponential curve. [1]  The likelyhood of the chip up and dying in a 
> year if you run it hot is low but it's not zero.  However, run it for 
> a few years like that and it's life will be shorter.
>
> The eventual outcomes of the thermal stress are non-functional P-N 
> junctions (device characteristics change) or short circuits (due to 
> metal migration).  Lots of steps have been taken to mitigate these 
> effects but the process can't be stopped entirely simply because 
> performance and economics prevent it.  The more performance we try to 
> push out of chip designs the more susceptible they become to this 
> types of damage.[2]  However, the expected useful operating life is so 
> short that the damage is simply accepted.  The manufacturer assumes 
> that after a few years the chip will end up in the trash because the 
> owner replaced it with the next greatest thing on the market at that 
> point.
>
>
>
> [1] Note that the temperatures you report are *external* temperatures 
> as reported near the package or heatsink, not on the silicon die 
> inside the package.  If your sensor is reporting 50 C it's very likely 
> that the on-die temperature is closer to 100 C.  What is most critical 
> for keeping your chip cool is to ensure maximum thermal transfer 
> (maximum conductance of heat) and a large temperature differential 
> (cold environment).  Lowering your internal case temperature by just 
> ten degrees (e.g. more airflow) can have a big effect on the die 
> temperature as more heat can flow out.  Also, ensuring that you have 
> good conduction between the package and the heatsink using an 
> appropriate (and *proper* amount of[3]) thermal grease (silver colloid 
> is best) will ensure heat does not become trapped in the package.
>
>
> [2]The Coppermine process that Intel developed several years ago was 
> one of the attempts.  Most chips prior to that point used aluminum as 
> the interconnect wiring.  However, if the temperature gets high 
> enough, the aluminum alloys with the silicon and "spikes" into it 
> causing shorts. With processes approaching nanometer scales[4], this 
> spiking was absolutely disastrous.  So Intel developed a process to 
> use copper instead of aluminum to wire the chip.  The idea was that 
> copper is a very good thermal conductor in addition to a good 
> electrical conductor (better than aluminum).  Lowering the electrical 
> resistance reduces heating (Ohm's power law P = I^2 * R ) and lowering 
> thermal resistance means the heat can be pulled away from the 
> junctions towards the outside of the package faster (a chip is 
> typically has metal covering 40-60% of its total area).  However, 
> copper can alloy with silicon at room temperature which leads to its 
> own problems.  So the copper is isolated from the silicon with thin 
> layers of other metals like platinum, titanium, nickel, tantalum, etc. 
> (e.g. anything not terribly reactive).  The Coppermine process worked 
> well for managing heat but it's more complicated and expensive to 
> process compared to the standard aluminum deposition methods.  There's 
> a lot of research into cooling chips at nanometer scales but not much 
> more beyond the Coppermine process that will help without major 
> modifications to fab lines.  This is part of the reason why the 
> mulit-core chips showed up.  Split the work across several cores with 
> an on-chip controller and the chips individually don't have to work so 
> hard which means they run cooler (and more efficiently since cooler 
> chips have larger signal margins and fewer errors).
>
> [3] There is such a thing as too much thermal grease.  A proper amount 
> fills the microscopic voids in the two macroscopically flat surfaces 
> without interfering with the physical proximity of the surfaces.  
> That's what permits maximum thermal transfer.  Too much grease puts a 
> gap between the surfaces and reduces thermal transfer efficiency.  
> Silver colloid is the best because the silver microparticles are 
> pliable and can help to fill the voids and improve contact between the 
> surfaces. The sticky pads that some companies put on their chips are 
> garbage. They're dirt cheap to make which is why the companies use 
> them (silver colloid is expensive) but they don't have very good 
> thermal conduction.  I rip out those pads whenever I find them and use 
> silver colloid. Again, just use a proper (tiny) amount and make sure 
> to not slather it on.  Not only will it reduce cooling efficiency but 
> it can also cause short circuits.
>
>
>
> [4] Chip scales have gotten so small and the device density so high 
> that the major manufacturers (Intel, AMD, Samsung, etc.) are having to 
> concern themselves with cosmic ray events.  Older chips might have one 
> cosmic ray event in a year or two.  Today's chips are so densely 
> packed with such tiny devices on them (approaching single digit 
> nanometers in many cases) that it's becoming more likely to see a 
> cosmic ray event once every week and sometimes once a day at higher 
> altitudes.  They've considered adding special circuits to monitor for 
> cosmic ray events and signal the CPU to repeat the last instruction if 
> an event is detected.
>

Hi Alex,

Wow.  Thanks for posting that great technical data.  Sorry if misstated 
things.  I guess there's a lot going on under the chip's cover.

The temperature I'm measuring is what's reported by the AMD overdrive 
utility as "temperature".  It's the same number speedfan reports as 
"core" temperature.  So, I'm assuming that's on die temperature and the 
same one that can only go to 71 deg C using my Phenom II x6 as an example.

So, what you're telling me is that my cpu's, memory, etc., will just 
spontaneously fail, even if they're not zapped by power surges and such.

I'm sitting next to a vintage 2002 laptop with a Pentium 4 chip.  At the 
moment, it still works, but I don't use it that often or that hard.  
Anyway, I've definitely been known to keep computers for 10 years.

If you have any idea, about how long would a Phenom II x6 be expected to 
last if it's always running below 40 deg C versus if it's always running 
at Tmax - 15 or 56 deg C?

I noticed you said they "considered" adding cosmic ray detection but it 
sounds like they haven't.

Sincerely,

Ron

-- 

(PS - If you email me and don't get a quick response, you might want to
call on the phone.  I get about 300 emails per day from alternate energy
mailing lists and such.  I don't always see new email messages very quickly.)

Ron Frazier
770-205-9422 (O)   Leave a message.
linuxdude AT techstarship.com
Litecoin: LZzAJu9rZEWzALxDhAHnWLRvybVAVgwTh3
Bitcoin: 15s3aLVsxm8EuQvT8gUDw3RWqvuY9hPGUU