Intermediate· ~2 min read#reinforcement-learning#rl#reward

Reinforcement Learning

RL · Learning by reward

An agent learns a behavior policy that maximizes long-term cumulative reward by trying actions in an environment and receiving feedback.

Definition

In reinforcement learning there is an agent, an environment, and a reward signal. The agent takes an action in the environment, the environment's state changes, and the agent receives a reward. The goal isn't a single good choice — it's learning a policy that maximizes total reward over a long horizon.

The crucial difference: nobody hands the agent the right answer. It learns by trying, sometimes making bad calls, sometimes discovering counter-intuitive moves that pay off later. This exploration vs exploitation trade-off defines RL's character.

Combined with deep neural networks, RL becomes deep RL — the technique behind AlphaGo, OpenAI's Dota agents, driving simulators, and robotic arms. Even RLHF — the human-feedback fine-tuning of large language models — is a flavor of RL.

Analogy

Like a baby learning to walk. The first step ends in a fall; propping against a chair gets a couple steps further. Nobody hands the baby a rulebook on muscle angles. Falling is a small punishment, walking is a small reward; over months, with that signal alone, the baby finds balance. RL works on exactly that signal.

Real-world example

A logistics company has 10,000 couriers and 5,000 active orders to distribute. The problem is too dynamic for hand-written rules: traffic, weather, courier fatigue, restaurant pace, customer patience all shift by the second.

An RL agent plays through thousands of simulated days. It assigns "courier 47 to order 312" and the environment returns a number: customer happy, route short → +8 reward. Wrong assignment, courier stuck in traffic, customer cancels → −5. After hundreds of thousands of simulations the agent has intuitions no engineer could code by hand. Deployed to real ops, it outperforms the rule-based system by a wide margin.

Code examples

Q-learning · classic RL loopPython

import numpy as np
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

# Q-table: state × action → expected total reward
Q = np.zeros((n_states, n_actions))

alpha = 0.1     # learning rate
gamma = 0.95    # weight of future rewards
epsilon = 0.1   # exploration rate

for episode in range(5000):
    state, _ = env.reset()
    done = False
    while not done:
        # Explore vs exploit
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = int(np.argmax(Q[state]))

        next_state, reward, done, _, _ = env.step(action)

        # Bellman update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

When to use

Action sequences with long-term consequences (games, robotics, ops)
No labels but a clear good/bad signal can be defined
The environment can be simulated — RL thrives on data volume
Classic optimization is intractable: dynamic, multi-variable

When not to use

One-off decisions — supervised learning is enough
No simulator and real-world trial is expensive (medical decisions)
You can't define a reward function
Interpretability is critical — RL policies are often opaque

Common pitfalls

Reward hacking

The agent maximizes the reward, not your real intent. Reward 'lift the box' and a robot may learn to lift and drop indefinitely. Reward design is the hardest engineering problem in RL.

Exploration vs exploitation imbalance

Too much exploration → slow learning. Too much exploitation → stuck in local optima. Epsilon-greedy, decaying epsilon, entropy bonuses balance the two.

Sample inefficiency

RL needs millions of attempts. Atari training plays hundreds of thousands of frames per second — a luxury the real world doesn't have. Hence the importance of simulation and sim-to-real transfer.