How are computers motivated in Q Learning?

332 views

[ad_1]

I get that it uses a number going up and down depending on the behavior, but why does the computer/AI try to make the number go up?

In: Technology
[ad_2]

The goal is to maximize the reward. It’s explicitly told that higher is better.

Nearly all machine learning is centered around maximizing our minimizing some target.

Computers don’t want, or care about, anything until we give them a program that tells them what to want/care about. (I’m anthropomorphizing computers here, obviously, but you get the point.)

In machine learning, we give the computer a “value function” or a “reward function” that tells them how good a job they’re doing, and then we tell them that their goal in life is to maximize that function. Once we tell them that, they go after it with everything they have. That’s simply how programming works; if you write code that gets the machine to come up with some options, work out what the reward would be for each option and then choose the one with the highest reward, then assuming the computer’s working correctly, that’s what it will do.

In some sense the number goes up because that’s all the AI knows to do.

When training the AI many different version are created with slight changes and the version which perform worse are removed (because we choose to remove them). This leads to the surviving version being better at optimising for the reward because if they weren’t they wouldn’t of survived.

So it’s not really motivated like people are, it just goes through steps depending on the input and the reward function is like a measure of how close it gets to the desired output.

There’s a evolution-like algorithm where a program leaves numbers which produced better result only. Later numbers decrease/increase not from some random value, but from these “generations” and then the elimination of worst results gets done again… Of course this is not the only algorithm, this is just an example.

In Q learn, the Q value means the total value of a state-action pair. For example, let’s say there’s a fork in the sidewalk. If I go left, I will immediately be rewarded with $5. If I go right, I will be immediately lose $10; however, if I continue down the right path another 100 ft, I will be rewarded with an additional $100. This means the Q value for going left is $5, and the Q value for going right is $90. Even though I lose money in the beginning, the Q value is still higher going right because the total reward is $100 minus $10. We calculate the Q value as immediate reward plus all future rewards.