Why does “reflowing” a gpu or similar electronic return its functionality?

143 views

I work at an arcade as a mechanic, I know mostly general electronics information, so I don’t know a lot of the more advanced architectural side of computers. My boss told me to reflow a dead GPU to see if it would work, so as per the instruction of the Youtube video, I waved a heat gun over it for 8 minutes. Plugged it back in, and what do you know, it worked.

Why does this work though, from my understanding, all this does is soften the solder for a few moments before letting it cool again.

In: 6

5 Answers

Anonymous 0 Comments

Yes, exactly. Sometimes soldering cracks and breaks. When heated, this issue can be fixed. However hearing up the entire board/card can cause issues.

Anonymous 0 Comments

The idea is that it melts the solder connections so that any broken connections between components and the board can reconnect.

Anonymous 0 Comments

Well, have you ever felt the heat that comes from a wire that carries electricity? Or any electronic for that mater

This heat is the main reason electronics fail, after many cycles of powering on and off, this constant expansion and contraction of the metal creates fisures (often invisible the naked eye, but more than capable to stop electricity from flowing). Elevating the temperature above the melting point (which is never reached under normal operating temperatures) allows the metal to flow and “repair” itself

Extra: this constant expansion and contraction (in this case due to heating and cooling) is better known as fatigue.

Anonymous 0 Comments

You reflowed a gpu as per an youtube video and it worked. Man you were lucky. That is tipically hard work as everything, all the solder balls, need to be perfectly aligned.

Anonymous 0 Comments

It DOES NOT.

This is one of the main myths circulating around the internet about how rrod happens and how to fix it, to some extent PS3 YLOD and a lot of laptop GPU failures (especially) 2010 ish MacBooks as well.

The big chips are designed with circuitry etched onto a piece of silicon, then a protection insulation layer, with holes passing through to the micro pads for connecting power/ground/signal traces into the chip circuitry.

Then the chip is turned over, pads facing down, put on a small piece of PCB called the interposer, with micro soldering bumps placed in between the pads on the silicon and the interposer, connecting the chip circuitry to the interposer. Having the chip turning over circuitry facing down is why they are called ”flip chip” packaging.

Then the space between the silicon and the interposer is filled with hard glue called underfill.

Then bigger solder balls get placed onto the pads under the interposer, this is the connection to the motherboard that are soldered at the final assembly factory.

The myth is that those bigger solder balls cracked and causing failure, reflowing and sometimes reballing the chip, making new solder connection from interposer to motherboard would fix the issue, this is WRONG.

The true problem is two fold:

One: the material for the interposer is not optimal, with thermal expansion coefficient too different from silicon, this causes the two components expand differently when heated up for micro solder bump connection. When the chip cools down, they contract differently, making the chip permanently warped a little bit, creating mechanical stress on the bumps. When the chip runs, temperature rises and warp the chip back, so with each power cycle, the bumps get exercised.

Two, the underfill material isn’t correct either, it has too low a “glass transition” temperature of only around 70 degrees celsius, while those high performance GPUs commonly run hotter than that, sometimes reaching more than 90 degrees. Solid glass like underfill turn to rubber like mush, retaining less than 1% of its advertised strength, no longer holding the silicon in place, the bumps then take all the mechanical stress from chip warping back and forth.

After a certain number of cycles, bumps break, on Xbox 360, it’s always the signal bumps connecting the GPU memory controller to RAM interface fail first due to their placement on the package, creating common error usually 0101 or 0102, the system cannot boot without RAM and the system management controller halts the boot throwing above error codes.

What’s even more unfortunate is that heating the board to around 150 degrees celsius will apparently fix the issue, because it softens the underfill and induces more warp changes, which may help push the bump cracks back together. Adjusting X-clamp pressure may also achieve the same effect, creating the illusion that it worked.

This apparent fix threw everybody in the wrong direction, even including Microsoft when they tried to figure out what went wrong. They would receive RRoD consoles, test to find RAM failure error codes, replace the RAM would need a BGA rework, which will heat up the entire board, resuscitating the GPU so they thought that was it, only having it fail again later.

Continued usage will further tear open the unstable bumps, further irreversibly ruining the GPU. Errors like E74 is another connection failing inside the package, between the big GPU silicon die and the smaller eDRAM die, therefore 100% proving it‘s NOT solder ball under the interposer failing because connection between those two never travels down to that level.

The wrong material choices is mainly due to push for performance leaving the chip running extremely hot and the compliance requirement for lead-free materials for solder bumps and other materials that also needed to be free of hazardous materials. Lead free meant higher soldering temperatures during micro bump joining, pre-building even more warpage stress into the package. Their decision to rush the product to the market meant NOBODY had adequate time to test the chips, they chose some of the materials they always used not realizing it will break.

Same issue also happened to the PS3, they also didn’t know these are the wrong material, but better package wiring design made the signal bumps never fail at first, only redundant power/ground pads fail. Some of them failing will not cause apparent issues. Only with more bumps failing and the remaining power bumps not able to handle the full load and burn out will it create actual console failures, pushing the onset of the problem much later, less ugly than the 360.