AnswerCult

Question

154 viewsJanuary 1, 2024

Question 90.84K January 20, 2023 0 Comments

I work at an arcade as a mechanic, I know mostly general electronics information, so I don’t know a lot of the more advanced architectural side of computers. My boss told me to reflow a dead GPU to see if it would work, so as per the instruction of the Youtube video, I waved a heat gun over it for 8 minutes. Plugged it back in, and what do you know, it worked.

Why does this work though, from my understanding, all this does is soften the solder for a few moments before letting it cool again.

In: 6

5 Answers

You are viewing 1 out of 5 answers, click here to view all answers.

Answer 1 · 2023-01-20T20:47:56+00:00

It DOES NOT.

This is one of the main myths circulating around the internet about how rrod happens and how to fix it, to some extent PS3 YLOD and a lot of laptop GPU failures (especially) 2010 ish MacBooks as well.

The big chips are designed with circuitry etched onto a piece of silicon, then a protection insulation layer, with holes passing through to the micro pads for connecting power/ground/signal traces into the chip circuitry.

Then the chip is turned over, pads facing down, put on a small piece of PCB called the interposer, with micro soldering bumps placed in between the pads on the silicon and the interposer, connecting the chip circuitry to the interposer. Having the chip turning over circuitry facing down is why they are called ”flip chip” packaging.

Then the space between the silicon and the interposer is filled with hard glue called underfill.

Then bigger solder balls get placed onto the pads under the interposer, this is the connection to the motherboard that are soldered at the final assembly factory.

The myth is that those bigger solder balls cracked and causing failure, reflowing and sometimes reballing the chip, making new solder connection from interposer to motherboard would fix the issue, this is WRONG.

The true problem is two fold:

One: the material for the interposer is not optimal, with thermal expansion coefficient too different from silicon, this causes the two components expand differently when heated up for micro solder bump connection. When the chip cools down, they contract differently, making the chip permanently warped a little bit, creating mechanical stress on the bumps. When the chip runs, temperature rises and warp the chip back, so with each power cycle, the bumps get exercised.

Two, the underfill material isn’t correct either, it has too low a “glass transition” temperature of only around 70 degrees celsius, while those high performance GPUs commonly run hotter than that, sometimes reaching more than 90 degrees. Solid glass like underfill turn to rubber like mush, retaining less than 1% of its advertised strength, no longer holding the silicon in place, the bumps then take all the mechanical stress from chip warping back and forth.

After a certain number of cycles, bumps break, on Xbox 360, it’s always the signal bumps connecting the GPU memory controller to RAM interface fail first due to their placement on the package, creating common error usually 0101 or 0102, the system cannot boot without RAM and the system management controller halts the boot throwing above error codes.

What’s even more unfortunate is that heating the board to around 150 degrees celsius will apparently fix the issue, because it softens the underfill and induces more warp changes, which may help push the bump cracks back together. Adjusting X-clamp pressure may also achieve the same effect, creating the illusion that it worked.

This apparent fix threw everybody in the wrong direction, even including Microsoft when they tried to figure out what went wrong. They would receive RRoD consoles, test to find RAM failure error codes, replace the RAM would need a BGA rework, which will heat up the entire board, resuscitating the GPU so they thought that was it, only having it fail again later.

Continued usage will further tear open the unstable bumps, further irreversibly ruining the GPU. Errors like E74 is another connection failing inside the package, between the big GPU silicon die and the smaller eDRAM die, therefore 100% proving it‘s NOT solder ball under the interposer failing because connection between those two never travels down to that level.

The wrong material choices is mainly due to push for performance leaving the chip running extremely hot and the compliance requirement for lead-free materials for solder bumps and other materials that also needed to be free of hazardous materials. Lead free meant higher soldering temperatures during micro bump joining, pre-building even more warpage stress into the package. Their decision to rush the product to the market meant NOBODY had adequate time to test the chips, they chose some of the materials they always used not realizing it will break.

Same issue also happened to the PS3, they also didn’t know these are the wrong material, but better package wiring design made the signal bumps never fail at first, only redundant power/ground pads fail. Some of them failing will not cause apparent issues. Only with more bumps failing and the remaining power bumps not able to handle the full load and burn out will it create actual console failures, pushing the onset of the problem much later, less ugly than the 360.

AnswerCult

Why does “reflowing” a gpu or similar electronic return its functionality?

5 Answers

Search questions

Popular Questions

Latest Answers