•  We usually use “default” values for the hyperparameters β1, β2 and ε in Adam (β1 = 0.9, β2 = 0.999,  )
•  The learning rate hyperparameter α in Adam usually needs to be tuned.
•  Adam should be used with batch gradient computations, not with mini-batches.