data-oriented and data-driven programming.

526 views

I’ve got a good idea of object-oriented, but explanations of the other two just confuse me.

In: Technology

2 Answers

Anonymous 0 Comments

Data-oriented and data-driven programming are two pretty distinct concepts, but not exclusive of each other.

I’ll explain using a game as an example (because that’s what I’m familiar with).

Data-driven means that the functionality of the program is driven by what data is fed into it.

Here’s an example. One way to make the menus for your game is to just write code to build the menus. So you’d have code that places all the buttons, and then code that decides what each button does. But a data-driven way to do it would be to have menus designed in an editor (or even some sort of HTML like text format) which is loaded into the game for the code to interpret. Rather than buttons being hardcoded to do particular things, you can assign actions to buttons in the editor.

Data-oriented programming is a more technical concept. The basic idea is to write code around how data is laid out in memory so that your code is faster. You need to know a bit about how RAM and CPU caches work to understand why you would do this. Reading from RAM is slow compared to reading data from the CPU cache. The CPU automatically transfers data from RAM into the cache when you access it. It assumes the next bit of data you’re probably going to need will be directly after the bit you just used. If that assumption is incorrect you get a “cache miss”, meaning it has to go off and read data directly from RAM again. So you want to avoid that, and as much as possible make sure the next bit of data you need is just after the bit you’ve just used.

A simple example of data oriented thinking would be to use contiguous arrays instead of arrays of pointers. Say you have a list of things that you want to update. The typical object oriented way to do that would be to create separate objects, and then have an array of pointers to those objects. That’s not very cache efficient because each object could be in a completely different place in memory. A data-oriented approach would be to have an array where each object is placed next to each other in memory. That means when when you go through that list to update each one, it should already be in the cache.

Sorry if the data-oriented explanation is a bit difficult to understand. It’s a fairly advanced concept.

Anonymous 0 Comments

Structures model data, classes model (often stateful) behavior. OOP is not about smart data, if you have a bunch of getters and setters (accessors and mutators) you don’t have any encapsulation – because changing a member changes it’s accessors and mutators, which is a change to the interface, which necessitates a recompilation of all source files dependent on that header, and a change to all the code that depends on its former interface, which may have a cascade of necessary changes. This isn’t OOP.

The OOP you learn in school isn’t OOP at all – those little single page programs you write are designed to illustrate one single concept at a time, they’re not meant to teach you anything about good or bad programming. And people since the 90s have extrapolated that OOP means putting all your data into classes and your instances into containers, all the while you have “smart data”.

Your OOP classes tend to model reality. If you have a car, it’s got a year, make, model, color, trim, wheel and tire sizes, etc… I ask you, does it actually make sense to put all of this unlike data together in one place?

What you are left with is a singular container of some odd size and odd alignment. If you’re not careful, you’ll have padding, both between members and between instances – if you have them in a contiguous container. You have islands of 1. And look at your cache lines, when you’re iterating over a range of cars, they’re populated with all this data that you don’t actually want, most of the time. Smart data is horribly cache inefficient. Let’s say you want to tally the car colors across all your instances; well, you have all this make, model, year, tire size, etc crap you don’t want and you’re not interested in!

Instead, how about if you thought about the DATA first and only. What would a car look like? Well, here’s a good start:

struct car {
std::vector<std::string> make, model, trim;
std::vector<int> year, rgb_color, tire_size, wheel_size;
};

And so what is a single instance of a car? “`i“`. A car instance is any one index across all these “parallel arrays”. So now if you want to iterate over all your cars and tally up the colors, you have all your color data in isolation. Your cache lines contain nothing but car colors. And you can write functional style code that doesn’t care WHAT a car “looks like”, you can abstract that concept away because the algorithm is pure and only cares about color, be it from a single car, a container of cars, or some other structure that has color data in it.

It does force you to really reconsider what you think you know about OOP. The data comes first. And so where does that leave classes? They end up looking a whole lot smaller. Often you’ll overload “`operator()“`, and they’ll only have a member or two that require no getter or setter. All that matters is the parameter passed in and the return value. They’re small, they’re cheap, and if you need to purge or restart, you’ll just make another instance. You’ll use classes for polymorphism. When you want a sorter of car color but you want to abstract the algorithm. All of a sudden it makes sense how compact a class is as some unit of encapsulation.

And consider your target processor is very likely a batch processor. Putting one algorithm and one dataset into the instruction and data caches, respectively, leads to performance. Your processor will prefetch the next cache line of pure data, and ideally you’ll have already sorted your data set so there’s little to no branching, or at least the data is in favor of branch prediction. Sorting your data before the heavy processing stage is another DOD principle – at the very least, the only data in the set you want is the data you actually need, so discriminate early. When you do this, your algorithms can be TERRIBLE, but perhaps unbeatably fast. I tell you, using DOD techniques, I was in high speed trading for ~7 years and in that time I wrote a linear search that was untouchably fast, WAY faster than binary search.