So if a gradient descent hyper-parameter controlling the learning rate is the usual way, how can this possibly be improved? Considering that in some way the approximation of future gradient alterations is distributed depending on the batch, the stability via an average gives a more stable basis to then infer an accelerated projection of the future descent.
The biggest problem to consider is bound oscillation. When the accelerated projection is magnifying the learning delta to apply such that locality is an asymptotic non-convergent (reverse symmetry in summation acceleration by considering the divergent terms as “merging toward” the first term limit). This then would converge as a metaseries in some instances, but not all. It then becomes essential to scale the approximations by inverse power weighting to make a convergent for highly entropic unstable weights. It may also indicate that weight decomposition may be an effective strategy to obtain a neuron split into the stable (time aligned) and the unstable (time inverted) partitions of a signal.
Assuming the unstable partition has a repellor (opposite to an attractor in chaos), modelling could be used to invert the accelerated projection to the repellor. If the accelerated series is approximated by an integral, the unstable inverse acceleration would perhaps be a reversal of the limits of integration? Or a sign reversal of the limits?
In a sense the splitting of the network into a composition of multiple networks based on partitions related to the number of critical negative signs (or more precisely the number of things that could have negative signs). In this case just 1 sign for a time is like hyper-parameter convergence property. The algorithm after decomposition can then be specifically optimized per partition.