Referring to a previous hybrid BWT/LZW compression method I have devised, the dictionary of the LZW can be stored in chain linked fixed size structure arrays one character (the symbol end) back linking to the first character through a chain. This makes efficient symbol indexing based on number, and with the slight addition of two extra pointers, a set of B-trees can be built separated by symbol length to also be loaded in inside parallel arrays for fast incremental finding of the existence of a symbol. A 16 bucket move to front hash table could also be used instead of a B-tree, depending on the trade off between memory of a 2 pointer B-tree, or a 1 pointer MTF collision hash chain.
On the nature of the BWT size, and the efficiency. Using the same LZW dictionary across multiple BWT blocks with the same suffix start character is effective with a minor edge effect, rapidly reducing in percentage as the block size increases. An interleave reordering such that the suffix start character is the primary group by of linearity, assists in the scan for serachability. The fact that a search can be rephrased as a join on various character pairings, the minimal character pair can be scanned up first, and “joined” to the end of the searched for string, and then joined to the beginning in a reverse search, to then pull all the matches sequentially.
Finding the suffixes in the LZW structure is relatively easy to produce symbol codes, to find the associated set of prefixes and infixes is a little more complex. A mostly constant search string can be effectively compiled and searched. A suitable secondary index extension mapping symbol sequences to “atomic” character sequences can be constructed to assist in the transform of characters to symbol dictionary index code tuples. This is a second level table in effect, which can be also compressed for atom specific search optimization without the LZW dictionary loading without find.
The fact the BWT infers an all matches sequential nature, and a second level of BWT with the dictionary index codes as the alphabet could defiantly reduce the needed scan time for finding each LZW symbol index sequence. Perhaps a unified B-tree as well as the length specific B-tree within the LZW dictionary would be useful for greater and less than constraints.
As the index can become a self index, there maybe a need to represent a row number along side the entry. Multi column indexes, or primary index keys would then best be likely represented as pointer tuples, with some minor speed size data duplication in context.
An extends chain pointer and a first of extends is not required, as the next length B-tree will part index all extenders. A root pointer to the extenders and a secondary B-tree on each entry would speed finding all suffix or contained in possibilities. Of course it would be best to place these 3 extra pointers in a parallel structure so not to be data interleaved array of struct, but struct of array, when dynamic compilation of atomics is required.
The find performance will be slower than an uncompressed B-tree, but the compression is useful to save storage space. The fact that the memory is used more effectively when compression is used, can sometimes lead to improved find performance for short matches, with a high volume of matches. An inverted index can use the position index of the LZW symbol containing the preceding to reduce the size of the pointers, and the BWT locality effect can reduce the number of pointers. This is more standard, and combined with the above techniques for sub phases or super phrases should give excellent find performance. For full record recovery, the found LZW symbols only provide decoding in context, and the full BWT block has to be decoded. A special reserved LZW symbol could precede a back pointer to the beginning of the BWT block, and work as a header of the post placed char count table and BWT order count.
So finding a particular LZW symbol in a block, can be iterated over, but the difficulty in speed is when the and condition comes in on the same inverse index. The squared time performance can be reduced? Reducing the number and size of the pointers in some ways help, but it does not reduce the essential scan and match nature of the time squared process. Ordering the matching to the “find” with least number on the count makes the iteration smaller on average, as it will be the least found, and hence least joined. The limiting of the join set to LZW symbols seems like it will bloom many invalid matches to be filtered, and in essence simplistically it does. But the lowering of the domain size allows application of some more techniques.
The first fact is the LZW symbols are in a BWT block subgroup based on the following characters. Not that helpful but does allow a fast filter, and less pointers before a full inverse BWT has to be done. The second fact is that the letter pair frequency effectively replaces the count as the join order priority of the and. It is further based on the BWT block subgroup size and the LZW symbol character counts for calculation of a pre match density of a symbol, this can be effectively estimated via statistics, and does not need a fetch of the actual subgroup size. In collecting multiple “find” items correlations can also be made on the information content of each, and a correlated but rarer “find” may be possible to substitute, or add in. Any common or un correlated “find” items should be ignored. Order by does tend to ruin some optimizations.
A “find” item combination cache should be maintained based on frequency of use and execution time to rebuild result both used in the eviction strategy. This in a real sense is a truncated “and” index. Replacing order by by some other method of such as order float, such that guaranteed order is not preserved, but some semblance of polarity is run. This may also be very useful to reduce sort time, and prevent excessive activity and hence time spent when limit clauses are used. The float itself should perhaps be record linked, with an MTF kind of thing in the inverse index.