It’s essentially counters and clocks.
Counters are exactly what they sound like. For example, every time a packet gets transmitted through a network card, the corresponding counter gets one number added to it. In order to know a rate, you have to know an amount of change, and the time frame over which that change happened (just as we express speed in miles per hour, or kilometers per second).
These counters are usually (but not always) maintained by the system kernel to prevent tampering. The kernel allows higher-level programs to read these counters. Monitoring software will periodically read these counters, and also record the exact time when the reading was taken. With multiple readings, you then have multiple data points with which to establish a rate.
For example let’s imagine we are monitoring network traffic, and reading the counters once every second. If we have a reading at 9:00:01 that the counter read 1005 packets sent and then another reading at 9:00:02 that the counter read 1015, then we have established for that time period that the average rate was 10 packets per second.
Almost everything that produces metrics is done with counters. For example, figuring out how busy a CPU is, is a matter of figuring out how much time it is spending working on things and how much time it is idle. This is a really tough problem to solve, because the CPU doesn’t initially know how much time has passed between the time when it makes one calculation, to the next. In order to figure this out, an external time source is used. The CPU reads the time, then does a set number of calculations, and then reads the time again from this external source, to see how much time has passed. The CPU then knows how many calculations it can do within a certain time. Now that the CPU has established its rate, it uses counters. There are a few ways to make these external time sources, but one of the most common is to use a crystal which oscillates at an exact, reliable, known frequency when an electric current is applied to it.
When the CPU works on calculations for a Firefox tab for example, it increments a counter after each calculation it does. Then those counts are compared against the rate it gathered earlier, in order to figure out how much time it spent working on those calculations. That way, it can determine for a given time period, what percentage of available CPU calculation time was spent working on that Firefox tab. When a monitoring tool says that one process is using 50% of the CPU, what that really means is how much of the CPU’s time it consumed, since the amount of calculations a CPU can do over a set period of time is the limited resource that we are using to do our computing.To put some numbers to this, here’s an example:
Let’s say a CPU powers on, and using the rate-establishing method we previously mentioned. It figures out that it can do 5,000,000 calculations per second (This is actually really slow by today’s standards). After it has been booted up, and a Firefox tab is loaded, a piece of monitoring software grabs the current count of how many calculations the CPU has spent working on that Firefox tab once every second. Between the first and second data points it sees an increase of 1,000,000 on the counter. We now know that this tab has consumed 20% of the CPU’s 5,000,000 calculations per second over the course of that one second. The monitoring software can then report that that Firefox tab is consuming 20% of the CPU’s time.
Astute readers will notice just the sheer number of assumptions that have to be made in these equations. For example, we have to assume the CPU hasn’t changed the number of calculations it can do since it powered on. It also assumes that the clock signal it is receiving is oscillating at the same frequency al the time, and not drifting at all. These things can happen, and do sometimes cause problems. There have been lots of technical advancements in solving these problems, I definitely don’t know all of them.
One particular issue I had the displeasure of troubleshooting at a workplace many years ago was figuring out why some computers would crash, and then crash the very same way after being reset, but would stop crashing after they were fully powered off and back on. These were Ubuntu Linux servers, and we were recording their kernel logs at the time of crash in order to gather diagnostic data and figure this out. What we realized was that the CPU will sometimes erroneously establish its calculation rate, and it would remember that rate until it was powered off completely.
I might have gotten a few things here wrong that some Comp Sci friends will help me out with. I don’t claim to know much, I can only share what I’ve learned through a few years of experience in running computers.
Hope this helps!
Latest Answers