mlx.optimizers.Muon

目錄

mlx.optimizers.Muon#

class Muon(learning_rate: float | Callable[[array], array], momentum: float = 0.95, weight_decay: float = 0.01, nesterov: bool = True, ns_steps: int = 5)#

The Muon optimizer.

Our Muon (MomentUm Orthogonalized by Newton-schulz) optimizer follows the original implementation: Muon: An optimizer for hidden layers in neural networks

備註

  • Muon may be sub-optimal for the embedding layer, the final fully connected layer, or any 0D/1D parameters. Those should be optimized by a different method (e.g., AdamW).

  • For 4D convolutional filters, it works by flattening their last dimensions.

參數:
  • learning_rate (float or callable) -- The learning rate.

  • momentum (float, optional) -- The momentum strength. Default: 0.95

  • weight_decay (float, optional) -- The weight decay (L2 penalty). Default: 0.01

  • nesterov (bool, optional) -- Enables Nesterov momentum. Recommended for better performance. Default: True

  • ns_steps (int, optional) -- Number of Newton-Schulz iteration steps for orthogonalization. Default: 5

方法

__init__(learning_rate[, momentum, ...])

apply_single(gradient, parameter, state)

Performs the Muon parameter update

init_single(parameter, state)

Initialize optimizer state