eli5: How do you aggregate percentiles?


Say you have 1 million websites. Here is a percentile chart of website CO2 emissions:

How would you calculate the aggregated emissions of 1 million websites from this data?

In: 2

One way to estimate would be to just take 1 million times the median amount 0.69 grams. That assums that the lowest value and highest value are similarly distant from the middle value, but would gets a reasonable estimate quickly. This gives us 690,000 g. This works better the more symmetric the buckets are (since 2.7 is a lot further from 0.69 than 0.15 we know this estimate is low).

In this case a better way would be to split the percentiles into ranges. So the first bucket would be the everything from the left tail to the midpoint between the 10th and 25th percentile, the next bucket goes from that midpoint to the midpoint of the next two percentiles and so on. That’s not perfect, you would basically be making estimate bars from a smooth curve, kind of like using the blue bars to estimate the total space under the black line in this [chart](https://thydzik.com/images/scatter-chart-as-histogram-with-normal-curve-corrected.png), but it’s better than our median value.

175,000 times the lowest amount (this covers the minimum value to 17.5% the midpoint between 10th and 25th percentiles)
plus 200000 times the 25th percentile amount (17.5th percentile to the 37.5th percentile)
plus 250000 times the 50th percentile (37.5th to 62.5th)
plus 200000 times the 75th percentile (62.5th to 82.5th)
plus 175000 times the 90th percentile (82.5 to the max)

Adding all that up is a good bit higher than our median estimate. A bit over 1,000,000 g. Given that distribution it’s likely that having more data at the top end would allow a better estimate (sort of like having Jeff Bezos in wealth data meaningfully impacts the average for the whole population).