Gradients and Descents

Consider a backpropagation which has just applied to a network under learning. It is obvious that various weights changed by various amounts. If a weight changes little it can be considered good. If a weight changes a lot it can be considered an essential definer weight. Consider the maximal definer weight (the one with the greatest change) and change it a further per cent in its defined direction. Feedforward the network and backpropagate again. Many of the good weights will go back to closer to where they were before definer pass and can be considered excellent. Others will deviate further and be considered ok.

The signed tally of definer(3)/excellent(0)/good(1)/ok(2) can be placed as a variable of programming in each neuron. The per cent weight to apply to a definer, or more explicitly the definer history deviation product as a weight to per cent for the definer’s direction makes a training map which is not necessary for using the net after training is finished. It does however even further processing such as “excellent definer” detection. What does it mean? 

In a continual learning system, it indicates a new rationale requirement for the problem as it has developed an unexpected change to an excellent performing neuron. The tally itself could also be considered an auxiliary output of any neuron, but what would be a suitable backpropagation for it? Why would it even need one? Is it not just another round of input to the network (perhaps not applied to the first layer, but then inputs don’t always have to be so).

Defining the concept of definer epilepsy where the definer oscillates due to weight gradient magnification implies the need for the tally to be a signed quantity and also implies that weight normalization to zero should also be present. This requires but has not been proven as the only sufficient condition that per cent growth from zero should be weighted slightly less than per cent reduction toward zero. This can be factored into an asymmetry stability meta.

A net of this form can have memory. The oscillation of definer neurons can represent state information. They can also define the modality of the net knowledge in application readiness while keeping the excellent all-purpose neurons stable. The next step is physical and affine coder estimators.

Limit Sums

The convergence sequence on a weighting can be considered isomorphic to a limit sum series acceleration. The net can be “thrown” into an estimate of an infinity of cycles programming on the examples. Effectiveness can be evaluated, and data estimated on the “window” over the sum as an inner product on weightings with bounds control mechanisms yet TBC. PID control systems indicate in the first estimate that differentials and integrals to reduce error and increase convergence speed are appropriate factors to measure.

Dynamics on the per cent definers so to speak. And it came to pass the adaptivity increased and performance metrics were good but then irrelevant as newer, better, more relevant ones took hold from the duties of the net. Gundup and Ciders incorporated had a little hindsight problem to solve.

Fractal Affine Representation

Going back to 1991 and Micheal Barnsley developing a fractal image compression system (Iterrated Systems FIF file format). The process was considered computationally intensive in time for very good compression. Experiments with the FIASCO compression system which is an open-source derivative indicate best performance lies in low quality (about 50%) is very fast, but not exact. If the compressed image is subtracted from the input image and further compressed as a residual a number of times, performance is improved dramatically.

Dissociating secondaries and tertiaries from the primary affine set allows disjunct affine sets to be constructed for equivalent compression performance where even a zip compression can remove further information redundancy. The affine sets can be used as input to a network, and in some sense, the net can develop some sort of affine invariance in the processed fractals. The data reduction of the affine compression is also likely to lead to better utilization of the net over a convolution CNN.

The Four Colour Disjunction Theorem.

Consider an extended ensemble. The first layer could be considered a fully connected layer distributor. The last layer could be considered to unify the output by being fully connected. Intermediate layers can be either fully connected or colour limited connected, where only neurons of a colour connect to neurons of the same colour in the next layer. This provides disjunction of weights between layers and removes a completion upon the gradient between colours.

Four is really just a way of seeing the colour partition and does not really have to be four. Is an ensemble of 2 nets of half size better for the same time and space complexity of computation with a resulting lower accuracy of one colour channel, but in total higher in discriminatory performance by the disjuction of the feature detection?

The leaking of cross information can also be reduced if it is considered that feature sets are disjunct. Each feature under low to non detection would not bleed into features under medium to high activation. Is the concept of grouped quench useful?

Query Key Transformer Reduction

From a switching idea in telecommunications, an N*N array can be reduced to a mostly functional due to sparsity N*L array pair and an L*L array. Any cross-product essentially becomes  (from its routing of an in into an out) a set of 3 sequential routings with the first and last being the compression and expansion multiplex to the smaller switch. Cross talk grows to some extent, but this “bleed” of attention is a small consideration given the fact that the variance spread of having 3 routing weights to product up to the one effective weight and computation is less due to L being a smaller number than N.

The Giant Neuron Hypothesis

Considering the output stage of a neuronal model is a level sliced integrator of sorts, the construction of RNN cells would seem obvious. The hypothesis asks if it is logical to consider the layers previous to an “integration” layer effectively an input stage where the whole network is a gigantic neuron and integration is performed on various nonlinear functions. Each integration channel can be considered independent but could also have post layers for further joining integral terms. The integration time can be considered another input set for per integrator functional.  To maintain tensor shape as two inputs per integrator are supplied the first differential would be good also especially where feedback can be applied.

This leads to the idea of the silicon conectome. Then as now as it became, integration was the nonlinear of choice in time (a softmax divided by the variable as goes with [e^x-1]/x. A groovemax if you will). The extra net uninueron integration layer offering the extra time feature of future estimation at an endpoint integral of network evolved choice. The complexity of backpropagation of the limit sum through fixed constants and differentiable functions for a zero adjustable layer insert with scaled estimation of earlier weight adjustment on previous samples in the time series under integration for an ideal propergatable. Wow, that table’s gay as.

This network idea is not necessarily recursive, and may just be an applied network with a global time delta since last evaluation for continuation of the processing of time series information. The actual recursive use of networks with GRU and LSTM cells might benefit from this kind of global integration processing, but can GRU and LSTM be improved? Bistable cells say yes, for a kind of registered sequential logic on the combinationals. Consider that a Moore state machine layout might be more reductionist to efficiency, a kind of register layer pair for production and consumption to bracket the net is under consideration.

The producer layer is easily pushed to be differentiable by being a weighted sum junction between the input and the feedback from the consumer layer. The consumer layer is more complex when differentiability is considered. The consumer register really could be replaced by a zeroth differential prediction of the future sample given past samples. This has an interesting property of pseudo presentation of the output of a network as a consumptive of the input. This allows use of the output in the backpropergation as input to modify weights on learning the feedback. The consumer must be passthrough, in its input to output while storage of samples for predictive differential generation is allowed.

So it’s really some kind of propergational Mealy state machine. A MNN if you’d kindly see. State of the art art of the state. Regenerative registration is a thing of the futured.

Block Tree Topological Proof of Work

Given that a blockchain has a limited entry rate on the chain due to the block uniqueness constraint. A more logical mass blocking system would used a tree graph, to place many leaf blocks on the tree at once. This can be done by assigning the fold of the leading edge of the tree onto random previous blocks, to achieve a number of virtual pointer rings, setting a joined pair of blocks as a new node in a Euler number mapping to a competition on genus and closure of the tree head leaf list to match block use demand.

The coin as it were, is the genus topology, with weighted construction ownership of node value. The data deciding part selection of the tree leaf node loop back pointers. The random, allowing a spread of topological properties in the proof of work space.

LZW (Perhaps with Dictionary Acceleration) Dictionaries in O(m) Memory

Referring to a previous hybrid BWT/LZW compression method I have devised, the dictionary of the LZW can be stored in chain linked fixed size structure arrays one character (the symbol end) back linking to the first character through a chain. This makes efficient symbol indexing based on number, and with the slight addition of two extra pointers, a set of B-trees can be built separated by symbol length to also be loaded in inside parallel arrays for fast incremental finding of the existence of a symbol. A 16 bucket move to front hash table could also be used instead of a B-tree, depending on the trade off between memory of a 2 pointer B-tree, or a 1 pointer MTF collision hash chain.

On the nature of the BWT size, and the efficiency. Using the same LZW dictionary across multiple BWT blocks with the same suffix start character is effective with a minor edge effect, rapidly reducing in percentage as the block size increases. An interleave reordering such that the suffix start character is the primary group by of linearity, assists in the scan for serachability. The fact that a search can be rephrased as a join on various character pairings, the minimal character pair can be scanned up first, and “joined” to the end of the searched for string, and then joined to the beginning in a reverse search, to then pull all the matches sequentially.

Finding the suffixes in the LZW structure is relatively easy to produce symbol codes, to find the associated set of prefixes and infixes is a little more complex. A mostly constant search string can be effectively compiled and searched. A suitable secondary index extension mapping symbol sequences to “atomic” character sequences can be constructed to assist in the transform of characters to symbol dictionary index code tuples. This is a second level table in effect, which can be also compressed for atom specific search optimization without the LZW dictionary loading without find.

The fact the BWT infers an all matches sequential nature, and a second level of BWT with the dictionary index codes as the alphabet could defiantly reduce the needed scan time for finding each LZW symbol index sequence. Perhaps a unified B-tree as well as the length specific B-tree within the LZW dictionary would be useful for greater and less than constraints.

As the index can become a self index, there maybe a need to represent a row number along side the entry. Multi column indexes, or primary index keys would then best be likely represented as pointer tuples, with some minor speed size data duplication in context.

An extends chain pointer and a first of extends is not required, as the next length B-tree will part index all extenders. A root pointer to the extenders and a secondary B-tree on each entry would speed finding all suffix or contained in possibilities. Of course it would be best to place these 3 extra pointers in a parallel structure so not to be data interleaved array of struct, but struct of array, when dynamic compilation of atomics is required.

The find performance will be slower than an uncompressed B-tree, but the compression is useful to save storage space. The fact that the memory is used more effectively when compression is used, can sometimes lead to improved find performance for short matches, with a high volume of matches. An inverted index can use the position index of the LZW symbol containing the preceding to reduce the size of the pointers, and the BWT locality effect can reduce the number of pointers. This is more standard, and combined with the above techniques for sub phases or super phrases should give excellent find performance. For full record recovery, the found LZW symbols only provide decoding in context, and the full BWT block has to be decoded. A special reserved LZW symbol could precede a back pointer to the beginning of the BWT block, and work as a header of the post placed char count table and BWT order count.

So finding a particular LZW symbol in a block, can be iterated over, but the difficulty in speed is when the and condition comes in on the same inverse index. The squared time performance can be reduced? Reducing the number and size of the pointers in some ways help, but it does not reduce the essential scan and match nature of the time squared process. Ordering the matching to the “find” with least number on the count makes the iteration smaller on average, as it will be the least found, and hence least joined. The limiting of the join set to LZW symbols seems like it will bloom many invalid matches to be filtered, and in essence simplistically it does. But the lowering of the domain size allows application of some more techniques.

The first fact is the LZW symbols are in a BWT block subgroup based on the following characters. Not that helpful but does allow a fast filter, and less pointers before a full inverse BWT has to be done. The second fact is that the letter pair frequency effectively replaces the count as the join order priority of the and. It is further based on the BWT block subgroup size and the LZW symbol character counts for calculation of a pre match density of a symbol, this can be effectively estimated via statistics, and does not need a fetch of the actual subgroup size. In collecting multiple “find” items correlations can also be made on the information content of each, and a correlated but rarer “find” may be possible to substitute, or add in. Any common or un correlated “find” items should be ignored. Order by does tend to ruin some optimizations.

A “find” item combination cache should be maintained based on frequency of use and execution time to rebuild result both used in the eviction strategy. This in a real sense is a truncated “and” index. Replacing order by by some other method of such as order float, such that guaranteed order is not preserved, but some semblance of polarity is run. This may also be very useful to reduce sort time, and prevent excessive activity and hence time spent when limit clauses are used. The float itself should perhaps be record linked, with an MTF kind of thing in the inverse index.