MLX Guru - files

2025-08-24 02:17:14 -04:00 · 2024-01-26 10:03:11 -08:00 · 2024-01-26 10:03:11 -08:00 · f96f19d404
commit f96f19d404
parent bd0099ad1f
8 changed files with 1579 additions and 0 deletions
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,23 @@
 .. _nn_functions:
 .. currentmodule:: mlx.nn
 Functions
 ---------
 Layers without parameters (e.g. activation functions) are also provided as
 simple functions.
 .. autosummary::
   :toctree: _autosummary_functions
   :template: nn-module-template.rst
   gelu
   gelu_approx
   gelu_fast_approx
   mish
   prelu
   relu
   selu
   silu
   step
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,45 @@
 .. _init:
 .. currentmodule:: mlx.nn.init
 Initializers
 ------------
 The ``mlx.nn.init`` package contains commonly used initializers for neural
 network parameters. Initializers return a function which can be applied to any
 input :obj:`mlx.core.array` to produce an initialized output.
 For example:
 .. code:: python
   import mlx.core as mx
   import mlx.nn as nn
   init_fn = nn.init.uniform()
   # Produces a [2, 2] uniform matrix
   param = init_fn(mx.zeros((2, 2)))
 To re-initialize all the parameter in an :obj:`mlx.nn.Module` from say a uniform 
 distribution, you can do:
 .. code:: python
   import mlx.nn as nn
   model = nn.Sequential(nn.Linear(5, 10), nn.ReLU(), nn.Linear(10, 5))
   init_fn = nn.init.uniform(low=-0.1, high=0.1)
   model.apply(init_fn)
 .. autosummary::
   :toctree: _autosummary
   constant
   normal
   uniform
   identity
   glorot_normal
   glorot_uniform
   he_normal
   he_uniform
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,37 @@
 .. _layers:
 .. currentmodule:: mlx.nn
 Layers
 ------
 .. autosummary::
   :toctree: _autosummary
   :template: nn-module-template.rst
   ALiBi
   BatchNorm
   Conv1d
   Conv2d
   Dropout
   Dropout2d
   Dropout3d
   Embedding
   GELU
   GroupNorm
   InstanceNorm
   LayerNorm
   Linear
   Mish
   MultiHeadAttention
   PReLU
   QuantizedLinear
   RMSNorm
   ReLU
   RoPE
   SELU
   Sequential
   SiLU
   SinusoidalPositionalEncoding
   Step
   Transformer
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,24 @@
 .. _losses:
 .. currentmodule:: mlx.nn.losses
 Loss Functions
 --------------
 .. autosummary::
   :toctree: _autosummary_functions
   :template: nn-module-template.rst
   binary_cross_entropy
   cosine_similarity_loss
   cross_entropy
   gaussian_nll_loss
   hinge_loss
   huber_loss
   kl_div_loss
   l1_loss
   log_cosh_loss
   mse_loss
   nll_loss
   smooth_l1_loss
   triplet_loss
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,36 @@
 Module
 ======
 .. currentmodule:: mlx.nn
 .. autoclass:: Module
   .. rubric:: Attributes
   .. autosummary::
      :toctree: _autosummary
      Module.training
   .. rubric:: Methods
   .. autosummary::
      :toctree: _autosummary
      Module.apply
      Module.apply_to_modules
      Module.children
      Module.eval
      Module.filter_and_map
      Module.freeze
      Module.leaf_modules
      Module.load_weights
      Module.modules
      Module.named_modules
      Module.parameters
      Module.save_weights
      Module.train
      Module.trainable_parameters
      Module.unfreeze
      Module.update
      Module.update_modules
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,183 @@
 .. _nn:
 .. currentmodule:: mlx.nn
 Neural Networks
 ===============
 Writing arbitrarily complex neural networks in MLX can be done using only
 :class:`mlx.core.array` and :meth:`mlx.core.value_and_grad`. However, this requires the
 user to write again and again the same simple neural network operations as well
 as handle all the parameter state and initialization manually and explicitly.
 The module :mod:`mlx.nn` solves this problem by providing an intuitive way of
 composing neural network layers, initializing their parameters, freezing them
 for finetuning and more.
 Quick Start with Neural Networks
 ---------------------------------
 .. code-block:: python
    import mlx.core as mx
    import mlx.nn as nn
    class MLP(nn.Module):
        def __init__(self, in_dims: int, out_dims: int):
            super().__init__()
            self.layers = [
                nn.Linear(in_dims, 128),
                nn.Linear(128, 128),
                nn.Linear(128, out_dims),
            ]
        def __call__(self, x):
            for i, l in enumerate(self.layers):
                x = mx.maximum(x, 0) if i > 0 else x
                x = l(x)
            return x
    # The model is created with all its parameters but nothing is initialized
    # yet because MLX is lazily evaluated
    mlp = MLP(2, 10)
    # We can access its parameters by calling mlp.parameters()
    params = mlp.parameters()
    print(params["layers"][0]["weight"].shape)
    # Printing a parameter will cause it to be evaluated and thus initialized
    print(params["layers"][0])
    # We can also force evaluate all parameters to initialize the model
    mx.eval(mlp.parameters())
    # A simple loss function.
    # NOTE: It doesn't matter how it uses the mlp model. It currently captures
    #       it from the local scope. It could be a positional argument or a
    #       keyword argument.
    def l2_loss(x, y):
        y_hat = mlp(x)
        return (y_hat - y).square().mean()
    # Calling `nn.value_and_grad` instead of `mx.value_and_grad` returns the
    # gradient with respect to `mlp.trainable_parameters()`
    loss_and_grad = nn.value_and_grad(mlp, l2_loss)
 .. _module_class:
 The Module Class
 ----------------
 The workhorse of any neural network library is the :class:`Module` class. In
 MLX the :class:`Module` class is a container of :class:`mlx.core.array` or
 :class:`Module` instances. Its main function is to provide a way to
 recursively **access** and **update** its parameters and those of its
 submodules.
 Parameters
 ^^^^^^^^^^
 A parameter of a module is any public member of type :class:`mlx.core.array` (its
 name should not start with ``_``). It can be arbitrarily nested in other
 :class:`Module` instances or lists and dictionaries.
 :meth:`Module.parameters` can be used to extract a nested dictionary with all
 the parameters of a module and its submodules.
 A :class:`Module` can also keep track of "frozen" parameters. See the
 :meth:`Module.freeze` method for more details. :meth:`mlx.nn.value_and_grad`
 the gradients returned will be with respect to these trainable parameters.
 Updating the Parameters
 ^^^^^^^^^^^^^^^^^^^^^^^
 MLX modules allow accessing and updating individual parameters. However, most
 times we need to update large subsets of a module's parameters. This action is
 performed by :meth:`Module.update`.
 Inspecting Modules
 ^^^^^^^^^^^^^^^^^^
 The simplest way to see the model architecture is to print it. Following along with
 the above example, you can print the ``MLP`` with:
 .. code-block:: python
  print(mlp)
 This will display:
 .. code-block:: shell
  MLP(
    (layers.0): Linear(input_dims=2, output_dims=128, bias=True)
    (layers.1): Linear(input_dims=128, output_dims=128, bias=True)
    (layers.2): Linear(input_dims=128, output_dims=10, bias=True)
  )
 To get more detailed information on the arrays in a :class:`Module` you can use
 :func:`mlx.utils.tree_map` on the parameters. For example, to see the shapes of
 all the parameters in a :class:`Module` do:
 .. code-block:: python
   from mlx.utils import tree_map
   shapes = tree_map(lambda p: p.shape, mlp.parameters())
 As another example, you can count the number of parameters in a :class:`Module`
 with:
 .. code-block:: python
   from mlx.utils import tree_flatten
   num_params = sum(v.size for _, v in tree_flatten(mlp.parameters()))
 Value and Grad
 --------------
 Using a :class:`Module` does not preclude using MLX's high order function
 transformations (:meth:`mlx.core.value_and_grad`, :meth:`mlx.core.grad`, etc.). However,
 these function transformations assume pure functions, namely the parameters
 should be passed as an argument to the function being transformed.
 There is an easy pattern to achieve that with MLX modules
 .. code-block:: python
    model = ...
    def f(params, other_inputs):
        model.update(params)  # <---- Necessary to make the model use the passed parameters
        return model(other_inputs)
    f(model.trainable_parameters(), mx.zeros((10,)))
 However, :meth:`mlx.nn.value_and_grad` provides precisely this pattern and only
 computes the gradients with respect to the trainable parameters of the model.
 In detail:
 - it wraps the passed function with a function that calls :meth:`Module.update`
  to make sure the model is using the provided parameters.
 - it calls :meth:`mlx.core.value_and_grad` to transform the function into a function
  that also computes the gradients with respect to the passed parameters.
 - it wraps the returned function with a function that passes the trainable
  parameters as the first argument to the function returned by
  :meth:`mlx.core.value_and_grad`
 .. autosummary::
   :toctree: _autosummary
   value_and_grad
 .. toctree::
   nn/module
   nn/layers
   nn/functions
   nn/losses
   nn/init
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,587 @@
 .. _array:
 Array
 =====
 .. currentmodule:: mlx.core
 .. autosummary:: 
   :toctree: _autosummary 
    array
    array.astype
    array.item
    array.tolist
    array.dtype
    array.ndim
    array.shape
    array.size
    Dtype
    array.abs
    array.all
    array.any
    array.argmax
    array.argmin
    array.cos
    array.dtype
    array.exp
    array.log
    array.log1p
    array.logsumexp
    array.max
    array.mean
    array.min
    array.prod
    array.reciprocal
    array.reshape
    array.round
    array.rsqrt
    array.sin
    array.split
    array.sqrt
    array.square
    array.sum
    array.transpose
    array.T
    array.var
 .. _data_types:
 :orphan:
 Data Types
 ==========
 .. currentmodule:: mlx.core
 The default floating point type is ``float32`` and the default integer type is
 ``int32``. The table below shows supported values for :obj:`Dtype`. 
 .. list-table:: Supported Data Types 
   :widths: 5 3 20
   :header-rows: 1
   * - Type 
     - Bytes
     - Description
   * - ``bool_``
     - 1 
     - Boolean (``True``, ``False``) data type
   * - ``uint8``
     - 1 
     - 8-bit unsigned integer 
   * - ``uint16``
     - 2 
     - 16-bit unsigned integer 
   * - ``uint32``
     - 4 
     - 32-bit unsigned integer 
   * - ``uint64``
     - 8 
     - 64-bit unsigned integer 
   * - ``int8``
     - 1 
     - 8-bit signed integer 
   * - ``int16``
     - 2 
     - 16-bit signed integer 
   * - ``int32``
     - 4 
     - 32-bit signed integer 
   * - ``int64``
     - 8 
     - 64-bit signed integer 
   * - ``float16``
     - 2 
     - 16-bit float, only available with `ARM C language extensions <https://developer.arm.com/documentation/101028/0012/3--C-language-extensions?lang=en>`_
   * - ``float32``
     - 4 
     - 32-bit float
 .. _devices_and_streams:
 Devices and Streams
 ===================
 .. currentmodule:: mlx.core
 .. autosummary::
  :toctree: _autosummary
   Device
   default_device
   set_default_device
   Stream
   default_stream
   new_stream
   set_default_stream
 .. _fft:
 FFT
 ===
 .. currentmodule:: mlx.core.fft
 .. autosummary:: 
  :toctree: _autosummary
  fft
  ifft
  fft2
  ifft2
  fftn
  ifftn
  rfft
  irfft
  rfft2
  irfft2
  rfftn
  irfftn
 .. _linalg:
 Linear Algebra
 ==============
 .. currentmodule:: mlx.core.linalg
 .. autosummary:: 
   :toctree: _autosummary 
    norm
 .. _nn:
 .. currentmodule:: mlx.nn
 Neural Networks
 ===============
 Writing arbitrarily complex neural networks in MLX can be done using only
 :class:`mlx.core.array` and :meth:`mlx.core.value_and_grad`. However, this requires the
 user to write again and again the same simple neural network operations as well
 as handle all the parameter state and initialization manually and explicitly.
 The module :mod:`mlx.nn` solves this problem by providing an intuitive way of
 composing neural network layers, initializing their parameters, freezing them
 for finetuning and more.
 Quick Start with Neural Networks
 ---------------------------------
 .. code-block:: python
    import mlx.core as mx
    import mlx.nn as nn
    class MLP(nn.Module):
        def __init__(self, in_dims: int, out_dims: int):
            super().__init__()
            self.layers = [
                nn.Linear(in_dims, 128),
                nn.Linear(128, 128),
                nn.Linear(128, out_dims),
            ]
        def __call__(self, x):
            for i, l in enumerate(self.layers):
                x = mx.maximum(x, 0) if i > 0 else x
                x = l(x)
            return x
    # The model is created with all its parameters but nothing is initialized
    # yet because MLX is lazily evaluated
    mlp = MLP(2, 10)
    # We can access its parameters by calling mlp.parameters()
    params = mlp.parameters()
    print(params["layers"][0]["weight"].shape)
    # Printing a parameter will cause it to be evaluated and thus initialized
    print(params["layers"][0])
    # We can also force evaluate all parameters to initialize the model
    mx.eval(mlp.parameters())
    # A simple loss function.
    # NOTE: It doesn't matter how it uses the mlp model. It currently captures
    #       it from the local scope. It could be a positional argument or a
    #       keyword argument.
    def l2_loss(x, y):
        y_hat = mlp(x)
        return (y_hat - y).square().mean()
    # Calling `nn.value_and_grad` instead of `mx.value_and_grad` returns the
    # gradient with respect to `mlp.trainable_parameters()`
    loss_and_grad = nn.value_and_grad(mlp, l2_loss)
 .. _module_class:
 The Module Class
 ----------------
 The workhorse of any neural network library is the :class:`Module` class. In
 MLX the :class:`Module` class is a container of :class:`mlx.core.array` or
 :class:`Module` instances. Its main function is to provide a way to
 recursively **access** and **update** its parameters and those of its
 submodules.
 Parameters
 ^^^^^^^^^^
 A parameter of a module is any public member of type :class:`mlx.core.array` (its
 name should not start with ``_``). It can be arbitrarily nested in other
 :class:`Module` instances or lists and dictionaries.
 :meth:`Module.parameters` can be used to extract a nested dictionary with all
 the parameters of a module and its submodules.
 A :class:`Module` can also keep track of "frozen" parameters. See the
 :meth:`Module.freeze` method for more details. :meth:`mlx.nn.value_and_grad`
 the gradients returned will be with respect to these trainable parameters.
 Updating the Parameters
 ^^^^^^^^^^^^^^^^^^^^^^^
 MLX modules allow accessing and updating individual parameters. However, most
 times we need to update large subsets of a module's parameters. This action is
 performed by :meth:`Module.update`.
 Inspecting Modules
 ^^^^^^^^^^^^^^^^^^
 The simplest way to see the model architecture is to print it. Following along with
 the above example, you can print the ``MLP`` with:
 .. code-block:: python
  print(mlp)
 This will display:
 .. code-block:: shell
  MLP(
    (layers.0): Linear(input_dims=2, output_dims=128, bias=True)
    (layers.1): Linear(input_dims=128, output_dims=128, bias=True)
    (layers.2): Linear(input_dims=128, output_dims=10, bias=True)
  )
 To get more detailed information on the arrays in a :class:`Module` you can use
 :func:`mlx.utils.tree_map` on the parameters. For example, to see the shapes of
 all the parameters in a :class:`Module` do:
 .. code-block:: python
   from mlx.utils import tree_map
   shapes = tree_map(lambda p: p.shape, mlp.parameters())
 As another example, you can count the number of parameters in a :class:`Module`
 with:
 .. code-block:: python
   from mlx.utils import tree_flatten
   num_params = sum(v.size for _, v in tree_flatten(mlp.parameters()))
 Value and Grad
 --------------
 Using a :class:`Module` does not preclude using MLX's high order function
 transformations (:meth:`mlx.core.value_and_grad`, :meth:`mlx.core.grad`, etc.). However,
 these function transformations assume pure functions, namely the parameters
 should be passed as an argument to the function being transformed.
 There is an easy pattern to achieve that with MLX modules
 .. code-block:: python
    model = ...
    def f(params, other_inputs):
        model.update(params)  # <---- Necessary to make the model use the passed parameters
        return model(other_inputs)
    f(model.trainable_parameters(), mx.zeros((10,)))
 However, :meth:`mlx.nn.value_and_grad` provides precisely this pattern and only
 computes the gradients with respect to the trainable parameters of the model.
 In detail:
 - it wraps the passed function with a function that calls :meth:`Module.update`
  to make sure the model is using the provided parameters.
 - it calls :meth:`mlx.core.value_and_grad` to transform the function into a function
  that also computes the gradients with respect to the passed parameters.
 - it wraps the returned function with a function that passes the trainable
  parameters as the first argument to the function returned by
  :meth:`mlx.core.value_and_grad`
 .. autosummary::
   :toctree: _autosummary
   value_and_grad
 .. toctree::
   nn/module
   nn/layers
   nn/functions
   nn/losses
   nn/init
 .. _ops:
 Operations
 ==========
 .. currentmodule:: mlx.core
 .. autosummary:: 
  :toctree: _autosummary
   abs
   add
   all
   allclose 
   any
   arange
   arccos
   arccosh
   arcsin
   arcsinh
   arctan
   arctanh
   argmax
   argmin
   argpartition
   argsort
   array_equal
   broadcast_to
   ceil
   clip
   concatenate
   convolve
   conv1d
   conv2d
   cos
   cosh
   dequantize
   divide
   divmod
   equal
   erf
   erfinv
   exp
   expand_dims
   eye
   flatten
   floor
   floor_divide
   full
   greater
   greater_equal
   identity
   inner
   isnan
   isposinf
   isneginf
   isinf
   less
   less_equal
   linspace
   load
   log
   log2
   log10
   log1p
   logaddexp
   logical_not
   logical_and
   logical_or
   logsumexp
   matmul
   max
   maximum
   mean
   min
   minimum
   moveaxis
   multiply
   negative
   ones
   ones_like
   outer
   partition
   pad
   prod
   quantize
   quantized_matmul
   reciprocal
   repeat
   reshape
   round
   rsqrt
   save
   savez
   savez_compressed
   save_gguf
   save_safetensors
   sigmoid
   sign
   sin
   sinh
   softmax
   sort
   split
   sqrt
   square
   squeeze
   stack
   stop_gradient
   subtract
   sum
   swapaxes
   take
   take_along_axis
   tan
   tanh
   tensordot
   transpose
   tri
   tril
   triu
   var
   where
   zeros
   zeros_like
 .. _optimizers:
 Optimizers
 ==========
 The optimizers in MLX can be used both with :mod:`mlx.nn` but also with pure
 :mod:`mlx.core` functions. A typical example involves calling
 :meth:`Optimizer.update` to update a model's parameters based on the loss
 gradients and subsequently calling :func:`mlx.core.eval` to evaluate both the
 model's parameters and the **optimizer state**.
 .. code-block:: python
    # Create a model
    model = MLP(num_layers, train_images.shape[-1], hidden_dim, num_classes)
    mx.eval(model.parameters())
    # Create the gradient function and the optimizer
    loss_and_grad_fn = nn.value_and_grad(model, loss_fn)
    optimizer = optim.SGD(learning_rate=learning_rate)
    for e in range(num_epochs):
        for X, y in batch_iterate(batch_size, train_images, train_labels):
            loss, grads = loss_and_grad_fn(model, X, y)
            # Update the model with the gradients. So far no computation has happened.
            optimizer.update(model, grads)
            # Compute the new parameters but also the optimizer state.
            mx.eval(model.parameters(), optimizer.state)
 .. currentmodule:: mlx.optimizers
 .. autosummary::
   :toctree: _autosummary
   :template: optimizers-template.rst
   OptimizerState
   Optimizer
   SGD
   RMSprop
   Adagrad
   Adafactor
   AdaDelta
   Adam
   AdamW
   Adamax
   Lion
 .. _random:
 Random
 ======
 Random sampling functions in MLX use an implicit global PRNG state by default.
 However, all function take an optional ``key`` keyword argument for when more
 fine-grained control or explicit state management is needed.
 For example, you can generate random numbers with:
 .. code-block:: python
  for _ in range(3):
    print(mx.random.uniform())
 which will print a sequence of unique pseudo random numbers. Alternatively you
 can explicitly set the key:
 .. code-block:: python
  key = mx.random.key(0)
  for _ in range(3):
    print(mx.random.uniform(key=key))
 which will yield the same pseudo random number at each iteration.
 Following `JAX's PRNG design <https://jax.readthedocs.io/en/latest/jep/263-prng.html>`_
 we use a splittable version of Threefry, which is a counter-based PRNG.
 .. currentmodule:: mlx.core.random
 .. autosummary:: 
  :toctree: _autosummary
   bernoulli
   categorical
   gumbel
   key
   normal
   randint
   seed
   split
   truncated_normal
   uniform
 .. _transforms:
 Transforms
 ==========
 .. currentmodule:: mlx.core
 .. autosummary::
  :toctree: _autosummary
   eval
   grad
   value_and_grad
   jvp
   vjp
   vmap
   simplify
 .. _utils:
 Tree Utils
 ==========
 In MLX we consider a python tree to be an arbitrarily nested collection of
 dictionaries, lists and tuples without cycles. Functions in this module that
 return python trees will be using the default python ``dict``, ``list`` and
 ``tuple`` but they can usually process objects that inherit from any of these.
 .. note::
   Dictionaries should have keys that are valid python identifiers.
 .. currentmodule:: mlx.utils
 .. autosummary:: 
  :toctree: _autosummary
   tree_flatten
   tree_unflatten
   tree_map
--- a/prompts/gpts/knowledge/MLX
+++ b/prompts/gpts/knowledge/MLX
@ -0,0 +1,644 @@
 .. _function_transforms:
 Function Transforms
 ===================
 .. currentmodule:: mlx.core
 MLX uses composable function transformations for automatic differentiation and
 vectorization. The key idea behind composable function transformations is that
 every transformation returns a function which can be further transformed. 
 Here is a simple example:
 .. code-block:: shell
   >>> dfdx = mx.grad(mx.sin)
   >>> dfdx(mx.array(mx.pi))
   array(-1, dtype=float32)
   >>> mx.cos(mx.array(mx.pi))
   array(-1, dtype=float32)
 The output of :func:`grad` on :func:`sin` is simply another function. In this
 case it is the gradient of the sine function which is exactly the cosine
 function. To get the second derivative you can do: 
 .. code-block:: shell
   >>> d2fdx2 = mx.grad(mx.grad(mx.sin))
   >>> d2fdx2(mx.array(mx.pi / 2))
   array(-1, dtype=float32)
   >>> mx.sin(mx.array(mx.pi / 2))
   array(1, dtype=float32)
 Using :func:`grad` on the output of :func:`grad` is always ok. You keep
 getting higher order derivatives.
 Any of the MLX function transformations can be composed in any order to any
 depth. To see the complete list of function transformations check-out the
 :ref:`API documentation <transforms>`. See the following sections for more
 information on :ref:`automatic differentiaion <auto diff>` and
 :ref:`automatic vectorization <vmap>`.
 Automatic Differentiation
 -------------------------
 .. _auto diff:
 Automatic differentiation in MLX works on functions rather than on implicit
 graphs. 
 .. note::
   If you are coming to MLX from PyTorch, you no longer need functions like
   ``backward``, ``zero_grad``, and ``detach``, or properties like
   ``requires_grad``.
 The most basic example is taking the gradient of a scalar-valued function as we
 saw above. You can use the :func:`grad` and :func:`value_and_grad` function to
 compute gradients of more complex functions. By default these functions compute
 the gradient with respect to the first argument:
 .. code-block:: python
   def loss_fn(w, x, y):
      return mx.mean(mx.square(w * x - y))
   w = mx.array(1.0)
   x = mx.array([0.5, -0.5])
   y = mx.array([1.5, -1.5])
   # Computes the gradient of loss_fn with respect to w:
   grad_fn = mx.grad(loss_fn)
   dloss_dw = grad_fn(w, x, y)
   # Prints array(-1, dtype=float32)
   print(dloss_dw)
   # To get the gradient with respect to x we can do:
   grad_fn = mx.grad(loss_fn, argnums=1)
   dloss_dx = grad_fn(w, x, y)
   # Prints array([-1, 1], dtype=float32)
   print(dloss_dx)
 One way to get the loss and gradient is to call ``loss_fn`` followed by
 ``grad_fn``, but this can result in a lot of redundant work. Instead, you
 should use :func:`value_and_grad`. Continuing the above example:
 .. code-block:: python
   # Computes the gradient of loss_fn with respect to w:
   loss_and_grad_fn = mx.value_and_grad(loss_fn)
   loss, dloss_dw = loss_and_grad_fn(w, x, y)
   # Prints array(1, dtype=float32)
   print(loss)
   # Prints array(-1, dtype=float32)
   print(dloss_dw)
 You can also take the gradient with respect to arbitrarily nested Python
 containers of arrays (specifically any of :obj:`list`, :obj:`tuple`, or
 :obj:`dict`).
 Suppose we wanted a weight and a bias parameter in the above example. A nice
 way to do that is the following:
 .. code-block:: python
   def loss_fn(params, x, y):
      w, b = params["weight"], params["bias"]
      h = w * x + b 
      return mx.mean(mx.square(h - y))
   params = {"weight": mx.array(1.0), "bias": mx.array(0.0)}
   x = mx.array([0.5, -0.5])
   y = mx.array([1.5, -1.5])
   # Computes the gradient of loss_fn with respect to both the
   # weight and bias:
   grad_fn = mx.grad(loss_fn)
   grads = grad_fn(params, x, y)
   # Prints
   # {'weight': array(-1, dtype=float32), 'bias': array(0, dtype=float32)}
   print(grads)
 Notice the tree structure of the parameters is preserved in the gradients.
 In some cases you may want to stop gradients from propagating through a 
 part of the function. You can use the :func:`stop_gradient` for that.
 Automatic Vectorization
 -----------------------
 .. _vmap:
 Use :func:`vmap` to automate vectorizing complex functions. Here we'll go
 through a basic and contrived example for the sake of clarity, but :func:`vmap`
 can be quite powerful for more complex functions which are difficult to optimize
 by hand.
 .. warning::
   Some operations are not yet supported with :func:`vmap`. If you encounter an error
   like: ``ValueError: Primitive's vmap not implemented.`` file an `issue
   <https://github.com/ml-explore/mlx/issues>`_ and include your function.
   We will prioritize including it.
 A naive way to add the elements from two sets of vectors is with a loop:
 .. code-block:: python
  xs = mx.random.uniform(shape=(4096, 100))
  ys = mx.random.uniform(shape=(100, 4096))
  def naive_add(xs, ys):
      return [xs[i] + ys[:, i] for i in range(xs.shape[1])]
 Instead you can use :func:`vmap` to automatically vectorize the addition:
 .. code-block:: python
   # Vectorize over the second dimension of x and the
   # first dimension of y
   vmap_add = mx.vmap(lambda x, y: x + y, in_axes=(1, 0))
 The ``in_axes`` parameter can be used to specify which dimensions of the
 corresponding input to vectorize over. Similarly, use ``out_axes`` to specify
 where the vectorized axes should be in the outputs. 
 Let's time these two different versions:
 .. code-block:: python
  import timeit
  print(timeit.timeit(lambda: mx.eval(naive_add(xs, ys)), number=100))
  print(timeit.timeit(lambda: mx.eval(vmap_add(xs, ys)), number=100))
 On an M1 Max the naive version takes in total ``0.390`` seconds whereas the
 vectorized version takes only ``0.025`` seconds, more than ten times faster.
 Of course, this operation is quite contrived. A better approach is to simply do
 ``xs + ys.T``, but for more complex functions :func:`vmap` can be quite handy.
 .. _indexing:
 Indexing Arrays
 ===============
 .. currentmodule:: mlx.core
 For the most part, indexing an MLX :obj:`array` works the same as indexing a
 NumPy :obj:`numpy.ndarray`. See the `NumPy documentation
 <https://numpy.org/doc/stable/user/basics.indexing.html>`_ for more details on
 how that works.
 For example, you can use regular integers and slices (:obj:`slice`) to index arrays:
 .. code-block:: shell
  >>> arr = mx.arange(10)
  >>> arr[3]
  array(3, dtype=int32)
  >>> arr[-2]  # negative indexing works
  array(8, dtype=int32)
  >>> arr[2:8:2] # start, stop, stride
  array([2, 4, 6], dtype=int32)
 For multi-dimensional arrays, the ``...`` or :obj:`Ellipsis` syntax works as in NumPy:
 .. code-block:: shell
  >>> arr = mx.arange(8).reshape(2, 2, 2)
  >>> arr[:, :, 0]
  array(3, dtype=int32)
  array([[0, 2],
         [4, 6]], dtype=int32
  >>> arr[..., 0]
  array([[0, 2],
         [4, 6]], dtype=int32
 You can index with ``None`` to create a new axis:
 .. code-block:: shell
  >>> arr = mx.arange(8)
  >>> arr.shape
  [8]
  >>> arr[None].shape
  [1, 8]
 You can also use an :obj:`array` to index another :obj:`array`:
 .. code-block:: shell
  >>> arr = mx.arange(10)
  >>> idx = mx.array([5, 7]) 
  >>> arr[idx]
  array([5, 7], dtype=int32)
 Mixing and matching integers, :obj:`slice`, ``...``, and :obj:`array` indices
 works just as in NumPy.
 Other functions which may be useful for indexing arrays are :func:`take` and
 :func:`take_along_axis`.
 Differences from NumPy
 ----------------------
 .. Note::
  MLX indexing is different from NumPy indexing in two important ways:
  * Indexing does not perform bounds checking. Indexing out of bounds is
    undefined behavior.
  * Boolean mask based indexing is not yet supported.
 The reason for the lack of bounds checking is that exceptions cannot propagate
 from the GPU. Performing bounds checking for array indices before launching the
 kernel would be extremely inefficient.
 Indexing with boolean masks is something that MLX may support in the future. In
 general, MLX has limited support for operations for which outputs
 *shapes* are dependent on input *data*. Other examples of these types of
 operations which MLX does not yet support include :func:`numpy.nonzero` and the
 single input version of :func:`numpy.where`.
 In Place Updates 
 ----------------
 In place updates to indexed arrays are possible in MLX. For example:
 .. code-block:: shell
  >>> a = mx.array([1, 2, 3])
  >>> a[2] = 0
  >>> a
  array([1, 2, 0], dtype=int32)
 Just as in NumPy, in place updates will be reflected in all references to the
 same array:
 .. code-block:: shell
  >>> a = mx.array([1, 2, 3])
  >>> b = a
  >>> b[2] = 0
  >>> b
  array([1, 2, 0], dtype=int32)
  >>> a
  array([1, 2, 0], dtype=int32)
 Transformations of functions which use in-place updates are allowed and work as
 expected. For example:
 .. code-block:: python
   def fun(x, idx):
       x[idx] = 2.0
       return x.sum()
   dfdx = mx.grad(fun)(mx.array([1.0, 2.0, 3.0]), mx.array([1]))
   print(dfdx)  # Prints: array([1, 0, 1], dtype=float32)
 In the above ``dfdx`` will have the correct gradient, namely zeros at ``idx``
 and ones elsewhere.
 .. _lazy eval:
 Lazy Evaluation
 ===============
 .. currentmodule:: mlx.core
 Why Lazy Evaluation
 -------------------
 When you perform operations in MLX, no computation actually happens. Instead a
 compute graph is recorded. The actual computation only happens if an
 :func:`eval` is performed.
 MLX uses lazy evaluation because it has some nice features, some of which we
 describe below. 
 Transforming Compute Graphs
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Lazy evaluation let's us record a compute graph without actually doing any
 computations. This is useful for function transformations like :func:`grad` and
 :func:`vmap` and graph optimizations like :func:`simplify`.
 Currently, MLX does not compile and rerun compute graphs. They are all
 generated dynamically. However, lazy evaluation makes it much easier to
 integrate compilation for future performance enhancements.
 Only Compute What You Use
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 In MLX you do not need to worry as much about computing outputs that are never
 used. For example:
 .. code-block:: python
  def fun(x):
      a = fun1(x)
      b = expensive_fun(a)
      return a, b
  y, _ = fun(x)
 Here, we never actually compute the output of ``expensive_fun``. Use this
 pattern with care though, as the graph of ``expensive_fun`` is still built, and
 that has some cost associated to it.
 Similarly, lazy evaluation can be beneficial for saving memory while keeping
 code simple. Say you have a very large model ``Model`` derived from
 :obj:`mlx.nn.Module`. You can instantiate this model with ``model = Model()``.
 Typically, this will initialize all of the weights as ``float32``, but the
 initialization does not actually compute anything until you perform an
 :func:`eval`. If you update the model with ``float16`` weights, your maximum
 consumed memory will be half that required if eager computation was used
 instead.
 This pattern is simple to do in MLX thanks to lazy computation:
 .. code-block:: python
  model = Model() # no memory used yet
  model.load_weights("weights_fp16.safetensors")
 When to Evaluate
 ----------------
 A common question is when to use :func:`eval`. The trade-off is between
 letting graphs get too large and not batching enough useful work.
 For example:
 .. code-block:: python
  for _ in range(100):
       a = a + b
       mx.eval(a)
       b = b * 2
       mx.eval(b)
 This is a bad idea because there is some fixed overhead with each graph
 evaluation. On the other hand, there is some slight overhead which grows with
 the compute graph size, so extremely large graphs (while computationally
 correct) can be costly.
 Luckily, a wide range of compute graph sizes work pretty well with MLX:
 anything from a few tens of operations to many thousands of operations per
 evaluation should be okay.
 Most numerical computations have an iterative outer loop (e.g. the iteration in
 stochastic gradient descent). A natural and usually efficient place to use
 :func:`eval` is at each iteration of this outer loop.
 Here is a concrete example:
 .. code-block:: python
   for batch in dataset:
       # Nothing has been evaluated yet
       loss, grad = value_and_grad_fn(model, batch)
       # Still nothing has been evaluated
       optimizer.update(model, grad)
       # Evaluate the loss and the new parameters which will
       # run the full gradient computation and optimizer update
       mx.eval(loss, model.parameters())
 An important behavior to be aware of is when the graph will be implicitly
 evaluated. Anytime you ``print`` an array, convert it to an
 :obj:`numpy.ndarray`, or otherwise access it's memory via :obj:`memoryview`,
 the graph will be evaluated. Saving arrays via :func:`save` (or any other MLX
 saving functions) will also evaluate the array.
 Calling :func:`array.item` on a scalar array will also evaluate it. In the
 example above, printing the loss (``print(loss)``) or adding the loss scalar to
 a list (``losses.append(loss.item())``) would cause a graph evaluation. If 
 these lines are before ``mx.eval(loss, model.parameters())`` then this
 will be a partial evaluation, computing only the forward pass.
 Also, calling :func:`eval` on an array or set of arrays multiple times is
 perfectly fine. This is effectively a no-op.
 .. warning::
  Using scalar arrays for control-flow will cause an evaluation.
 Here is an example:
 .. code-block:: python
   def fun(x):
       h, y = first_layer(x)
       if y > 0:  # An evaluation is done here!
           z  = second_layer_a(h)
       else:
           z  = second_layer_b(h)
       return z
 Using arrays for control flow should be done with care. The above example works
 and can even be used with gradient transformations. However, this can be very
 inefficient if evaluations are done too frequently.
 .. _numpy:
 Conversion to NumPy and Other Frameworks
 ========================================
 MLX array implements the `Python Buffer Protocol <https://docs.python.org/3/c-api/buffer.html>`_.
 Let's convert an array to NumPy and back.
 .. code-block:: python
  import mlx.core as mx
  import numpy as np
  a = mx.arange(3)
  b = np.array(a) # copy of a
  c = mx.array(b) # copy of b
 .. note::
    Since NumPy does not support ``bfloat16`` arrays, you will need to convert to ``float16`` or ``float32`` first:
    ``np.array(a.astype(mx.float32))``.
    Otherwise, you will receive an error like: ``Item size 2 for PEP 3118 buffer format string does not match the dtype V item size 0.``
 By default, NumPy copies data to a new array. This can be prevented by creating an array view:
 .. code-block:: python
  a = mx.arange(3)
  a_view = np.array(a, copy=False)
  print(a_view.flags.owndata) # False
  a_view[0] = 1
  print(a[0].item()) # 1
 A NumPy array view is a normal NumPy array, except that it does not own its memory.
 This means writing to the view is reflected in the original array.
 While this is quite powerful to prevent copying arrays, it should be noted that external changes to the memory of arrays cannot be reflected in gradients.
 Let's demonstrate this in an example:
 .. code-block:: python
  def f(x):
      x_view = np.array(x, copy=False)
      x_view[:] *= x_view # modify memory without telling mx
      return x.sum()
  x = mx.array([3.0])
  y, df = mx.value_and_grad(f)(x)
  print("f(x) = x² =", y.item()) # 9.0
  print("f'(x) = 2x !=", df.item()) # 1.0
 The function ``f`` indirectly modifies the array ``x`` through a memory view.
 However, this modification is not reflected in the gradient, as seen in the last line outputting ``1.0``,
 representing the gradient of the sum operation alone.
 The squaring of ``x`` occurs externally to MLX, meaning that no gradient is incorporated.
 It's important to note that a similar issue arises during array conversion and copying.
 For instance, a function defined as ``mx.array(np.array(x)**2).sum()`` would also result in an incorrect gradient,
 even though no in-place operations on MLX memory are executed.
 PyTorch
 -------
 .. warning:: 
   PyTorch Support for :obj:`memoryview` is experimental and can break for
   multi-dimensional arrays. Casting to NumPy first is advised for now.
 PyTorch supports the buffer protocol, but it requires an explicit :obj:`memoryview`.
 .. code-block:: python
  import mlx.core as mx
  import torch
  a = mx.arange(3)
  b = torch.tensor(memoryview(a))
  c = mx.array(b.numpy())
 Conversion from PyTorch tensors back to arrays must be done via intermediate NumPy arrays with ``numpy()``.
 JAX
 ---
 JAX fully supports the buffer protocol.
 .. code-block:: python
  import mlx.core as mx
  import jax.numpy as jnp
  a = mx.arange(3)
  b = jnp.array(a)
  c = mx.array(b)
 TensorFlow
 ----------
 TensorFlow supports the buffer protocol, but it requires an explicit :obj:`memoryview`.
 .. code-block:: python
  import mlx.core as mx
  import tensorflow as tf
  a = mx.arange(3)
  b = tf.constant(memoryview(a))
  c = mx.array(b)
 .. _saving_and_loading:
 Saving and Loading Arrays
 =========================
 .. currentmodule:: mlx.core
 MLX supports multiple array serialization formats.
 .. list-table:: Serialization Formats
   :widths: 20 8 25 25 
   :header-rows: 1
   * - Format 
     - Extension 
     - Function
     - Notes 
   * - NumPy 
     - ``.npy`` 
     - :func:`save`
     - Single arrays only
   * - NumPy archive 
     - ``.npz`` 
     - :func:`savez` and :func:`savez_compressed`
     - Multiple arrays 
   * - Safetensors
     - ``.safetensors`` 
     - :func:`save_safetensors`
     - Multiple arrays 
   * - GGUF 
     - ``.gguf`` 
     - :func:`save_gguf`
     - Multiple arrays
 The :func:`load` function will load any of the supported serialization
 formats. It determines the format from the extensions. The output of
 :func:`load` depends on the format. 
 Here's an example of saving a single array to a file:
 .. code-block:: shell
   >>> a = mx.array([1.0])
   >>> mx.save("array", a)
 The array ``a`` will be saved in the file ``array.npy`` (notice the extension
 is automatically added). Including the extension is optional; if it is missing
 it will be added. You can load the array with:
 .. code-block:: shell
   >>> mx.load("array.npy", a)
   array([1], dtype=float32)
 Here's an example of saving several arrays to a single file:
 .. code-block:: shell
   >>> a = mx.array([1.0])
   >>> b = mx.array([2.0])
   >>> mx.savez("arrays", a, b=b)
 For compatibility with :func:`numpy.savez` the MLX :func:`savez` takes arrays
 as arguments. If the keywords are missing, then default names will be
 provided. This can be loaded with:
 .. code-block:: shell
   >>> mx.load("arrays.npz")
   {'b': array([2], dtype=float32), 'arr_0': array([1], dtype=float32)}
 In this case :func:`load` returns a dictionary of names to arrays.
 The functions :func:`save_safetensors` and :func:`save_gguf` are similar to
 :func:`savez`, but they take as input a :obj:`dict` of string names to arrays:
 .. code-block:: shell
   >>> a = mx.array([1.0])
   >>> b = mx.array([2.0])
   >>> mx.save_safetensors("arrays", {"a": a, "b": b})