Gradients and Descents

Consider a backpropagation which has just applied to a network under learning. It is obvious that various weights changed by various amounts. If a weight changes little it can be considered good. If a weight changes a lot it can be considered an essential definer weight. Consider the maximal definer weight (the one with the greatest change) and change it a further per cent in its defined direction. Feedforward the network and backpropagate again. Many of the good weights will go back to closer to where they were before definer pass and can be considered excellent. Others will deviate further and be considered ok.

The signed tally of definer(3)/excellent(0)/good(1)/ok(2) can be placed as a variable of programming in each neuron. The per cent weight to apply to a definer, or more explicitly the definer history deviation product as a weight to per cent for the definer’s direction makes a training map which is not necessary for using the net after training is finished. It does however even further processing such as “excellent definer” detection. What does it mean? 

In a continual learning system, it indicates a new rationale requirement for the problem as it has developed an unexpected change to an excellent performing neuron. The tally itself could also be considered an auxiliary output of any neuron, but what would be a suitable backpropagation for it? Why would it even need one? Is it not just another round of input to the network (perhaps not applied to the first layer, but then inputs don’t always have to be so).

Defining the concept of definer epilepsy where the definer oscillates due to weight gradient magnification implies the need for the tally to be a signed quantity and also implies that weight normalization to zero should also be present. This requires but has not been proven as the only sufficient condition that per cent growth from zero should be weighted slightly less than per cent reduction toward zero. This can be factored into an asymmetry stability meta.

A net of this form can have memory. The oscillation of definer neurons can represent state information. They can also define the modality of the net knowledge in application readiness while keeping the excellent all-purpose neurons stable. The next step is physical and affine coder estimators.

Limit Sums

The convergence sequence on a weighting can be considered isomorphic to a limit sum series acceleration. The net can be “thrown” into an estimate of an infinity of cycles programming on the examples. Effectiveness can be evaluated, and data estimated on the “window” over the sum as an inner product on weightings with bounds control mechanisms yet TBC. PID control systems indicate in the first estimate that differentials and integrals to reduce error and increase convergence speed are appropriate factors to measure.

Dynamics on the per cent definers so to speak. And it came to pass the adaptivity increased and performance metrics were good but then irrelevant as newer, better, more relevant ones took hold from the duties of the net. Gundup and Ciders incorporated had a little hindsight problem to solve.

Fractal Affine Representation

Going back to 1991 and Micheal Barnsley developing a fractal image compression system (Iterrated Systems FIF file format). The process was considered computationally intensive in time for very good compression. Experiments with the FIASCO compression system which is an open-source derivative indicate best performance lies in low quality (about 50%) is very fast, but not exact. If the compressed image is subtracted from the input image and further compressed as a residual a number of times, performance is improved dramatically.

Dissociating secondaries and tertiaries from the primary affine set allows disjunct affine sets to be constructed for equivalent compression performance where even a zip compression can remove further information redundancy. The affine sets can be used as input to a network, and in some sense, the net can develop some sort of affine invariance in the processed fractals. The data reduction of the affine compression is also likely to lead to better utilization of the net over a convolution CNN.

The Four Colour Disjunction Theorem.

Consider an extended ensemble. The first layer could be considered a fully connected layer distributor. The last layer could be considered to unify the output by being fully connected. Intermediate layers can be either fully connected or colour limited connected, where only neurons of a colour connect to neurons of the same colour in the next layer. This provides disjunction of weights between layers and removes a completion upon the gradient between colours.

Four is really just a way of seeing the colour partition and does not really have to be four. Is an ensemble of 2 nets of half size better for the same time and space complexity of computation with a resulting lower accuracy of one colour channel, but in total higher in discriminatory performance by the disjuction of the feature detection?

The leaking of cross information can also be reduced if it is considered that feature sets are disjunct. Each feature under low to non detection would not bleed into features under medium to high activation. Is the concept of grouped quench useful?

Query Key Transformer Reduction

From a switching idea in telecommunications, an N*N array can be reduced to a mostly functional due to sparsity N*L array pair and an L*L array. Any cross-product essentially becomes  (from its routing of an in into an out) a set of 3 sequential routings with the first and last being the compression and expansion multiplex to the smaller switch. Cross talk grows to some extent, but this “bleed” of attention is a small consideration given the fact that the variance spread of having 3 routing weights to product up to the one effective weight and computation is less due to L being a smaller number than N.

The Giant Neuron Hypothesis

Considering the output stage of a neuronal model is a level sliced integrator of sorts, the construction of RNN cells would seem obvious. The hypothesis asks if it is logical to consider the layers previous to an “integration” layer effectively an input stage where the whole network is a gigantic neuron and integration is performed on various nonlinear functions. Each integration channel can be considered independent but could also have post layers for further joining integral terms. The integration time can be considered another input set for per integrator functional.  To maintain tensor shape as two inputs per integrator are supplied the first differential would be good also especially where feedback can be applied.

This leads to the idea of the silicon conectome. Then as now as it became, integration was the nonlinear of choice in time (a softmax divided by the variable as goes with [e^x-1]/x. A groovemax if you will). The extra net uninueron integration layer offering the extra time feature of future estimation at an endpoint integral of network evolved choice. The complexity of backpropagation of the limit sum through fixed constants and differentiable functions for a zero adjustable layer insert with scaled estimation of earlier weight adjustment on previous samples in the time series under integration for an ideal propergatable. Wow, that table’s gay as.

This network idea is not necessarily recursive, and may just be an applied network with a global time delta since last evaluation for continuation of the processing of time series information. The actual recursive use of networks with GRU and LSTM cells might benefit from this kind of global integration processing, but can GRU and LSTM be improved? Bistable cells say yes, for a kind of registered sequential logic on the combinationals. Consider that a Moore state machine layout might be more reductionist to efficiency, a kind of register layer pair for production and consumption to bracket the net is under consideration.

The producer layer is easily pushed to be differentiable by being a weighted sum junction between the input and the feedback from the consumer layer. The consumer layer is more complex when differentiability is considered. The consumer register really could be replaced by a zeroth differential prediction of the future sample given past samples. This has an interesting property of pseudo presentation of the output of a network as a consumptive of the input. This allows use of the output in the backpropergation as input to modify weights on learning the feedback. The consumer must be passthrough, in its input to output while storage of samples for predictive differential generation is allowed.

So it’s really some kind of propergational Mealy state machine. A MNN if you’d kindly see. State of the art art of the state. Regenerative registration is a thing of the futured.