General Update

An update on the current progress of projects and general things here at KRT. I’ve set about checking out TypeScript for using in projects. It looks good, has some hidden pitfalls on finding .m.ts files for underscore for example, but in general looks good. I’m running it over some JS to get more of a feel. The audio VST project is moving slowly, at oscillators at the moment, with filters being done. I am looking into cache coherence algorithms and strategies to ease hardware design at the moment too. The 68k2 document mentioned in previous post is expanding with some of these ideas in having a “stall on value match” register, with a “touch since changed” bit in each cache line.

All good.

The Processor Design Document in Progress

TypeScript

Well I eventually managed to get a file using _.reduce() to compile without errors now. I’ll test it as soon as I’ve adapted in QUnit 2.0.1 so I can write my tests to the build as a pop up window, an perhaps back load a file to then be able to save the file from within the editor, and hence to become parser frame.

Representation

An excerpt from the 68k2 document as it’s progressing. An idea on UTF8 easy indexing and expansion.

“Reducing the size of this indexing array can recursively use the same technique, as long as movement between length encodings is not traversed for long sequences. This would require adding in a 2 length (11 bit form) and a 3 length (16 bit form) of common punctuation and spacing. Surrogate pair just postpones the issue and moves cache occupation to 25%, and not quite that for speed efficiency. This is why the simplified Chinese is common circa 2017, and surrogate processing has been abandoned in the Unicode specification, and replaced by characters in the surrogate representation space. Hand drawing the surrogates was likely the issue, and character parts (as individual parts) with double strike was considered a better rendering option.

UTF8 therefore has a possible 17 bit rendering for due to the extra bit freed by not needing a UTF32 representation. Should this be glyph space, or skip code index space, or a mix? 16 bit purity says skip code space. With common length (2 bit) and count (14 bit), allowing skips of between 16 kB and 48 kB through a document. The 4^th combination of length? Perhaps the representation of the common punctuation without character length alterations. For 512 specials in the 2 length form and 65536 specials in the 3 length forms. In UTF16 there would be issues of decode, and uniqueness. This perhaps is best tackled by some render form meta characters in the original Unicode space. There is no way around it, and with skips maybe UTF8 would be faster.”


// tool.js 1.1.1
// https://kring.co.uk
// (c) 2016-2017 Simon Jackson, K Ring Technologies Ltd
// MIT, like as he said. And underscored :D

import * as _ from 'underscore';

//==============================================================================
// LZW-compress a string
//==============================================================================
// The bounce parameter if true adds extra entries for faster dictionary growth.
// Usually LZW dictionary grows sub linear on input chars, and it is of note
// that after a BWT, the phrase contains a good MTF estimate and so maybe fine
// to append each of its chars to many dictionary entries. In this way the
// growth of entries becomes "almost" linear. The dictionary memory foot print
// becomes quadratic. Short to medium inputs become even smaller. Long input
// lengths may become slightly larger on not using dictionary entries integrated
// over input length, but will most likely be slightly smaller.

// DO NOT USE bounce (=false) IF NO BWT BEFORE.
// Under these conditions many unused dictionary entries will be wasted on long
// highly redundant inputs. It is a feature for pre BWT packed PONs.
//===============================================================================
function encodeLZW(data: string, bounce: boolean): string {
var dict = {};
data = encodeSUTF(data);
var out = [];
var currChar;
var phrase = data[0];
var codeL = 0;
var code = 256;
for (var i=1; i<data.length; i++) {
currChar=data[i];
if (dict['_' + phrase + currChar] != null) {
phrase += currChar;
}
else {
out.push(codeL = phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
if(code < 65536) {//limit
dict['_' + phrase + currChar] = code;
code++;
if(bounce && codeL != code - 2) {//code -- and one before would be last symbol out
_.each(phrase.split(''), function (chr) {
if(code < 65536) {
while(dict['_' + phrase + chr]) phrase += chr;
dict['_' + phrase + chr] = code;
code++;
}
});
}
}
phrase=currChar;
}
}
out.push(phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
for (var i=0; i<out.length; i++) {
out[i] = String.fromCharCode(out[i]);
}
return out.join();
}

function encodeSUTF(s: string): string {
s = encodeUTF(s);
var out = [];
var msb: number = 0;
var two: boolean = false;
var first: boolean = true;
_.each(s, function(val) {
var k = val.charCodeAt(0);
if(k > 127) {
if (first == true) {
first = false;
two = (k & 32) == 0;
if (k == msb) return;
msb = k;
} else {
if (two == true) two = false;
else first = true;
}
}
out.push(String.fromCharCode(k));
});
return out.join();
}

function encodeBounce(s: string): string {
return encodeLZW(s, true);
}

//=================================================
// Decompress an LZW-encoded string
//=================================================
function decodeLZW(s: string, bounce: boolean): string {
var dict = {};
var dictI = {};
var data = (s + '').split('');
var currChar = data[0];
var oldPhrase = currChar;
var out = [currChar];
var code = 256;
var phrase;
for (var i=1; i<data.length; i++) {
var currCode = data[i].charCodeAt(0);
if (currCode < 256) {
phrase = data[i];
}
else {
phrase = dict['_'+currCode] ? dict['_'+currCode] : (oldPhrase + currChar);
}
out.push(phrase);
currChar = phrase.charAt(0);
if(code < 65536) {
dict['_'+code] = oldPhrase + currChar;
dictI['_' + oldPhrase + currChar] = code;
code++;
if(bounce && !dict['_'+currCode]) {//the special lag
_.each(oldPhrase.split(''), function (chr) {
if(code < 65536) {
while(dictI['_' + oldPhrase + chr]) oldPhrase += chr;
dict['_' + code] = oldPhrase + chr;
dictI['_' + oldPhrase + chr] = code;
code++;
}
});
}
}
oldPhrase = phrase;
}
return decodeSUTF(out.join(''));
}

function decodeSUTF(s: string): string {
var out = [];
var msb: number = 0;
var make: number = 0;
var from: number = 0;
_.each(s, function(val, idx) {
var k = val.charCodeAt(0);
if (k > 127) {
if (idx < from + make) return;
if ((k & 128) != 0) {
msb = k;
make = (k & 64) == 0 ? 2 : 3;
from = idx + 1;
} else {
from = idx;
}
out.push(String.fromCharCode(msb));
for (var i = from; i < from + make; i++) {
out.push(s[i]);
}
return;
} else {
out.push(String.fromCharCode(k));
}
});
return decodeUTF(out.join());
}

function decodeBounce(s: string): string {
return decodeLZW(s, true);
}

//=================================================
// UTF mangling with ArrayBuffer mappings
//=================================================
declare function escape(s: string): string;
declare function unescape(s: string): string;

function encodeUTF(s: string): string {
return unescape(encodeURIComponent(s));
}

function decodeUTF(s: string): string {
return decodeURIComponent(escape(s));
}

function toBuffer(str: string): ArrayBuffer {
var arr = encodeSUTF(str);
var buf = new ArrayBuffer(arr.length);
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = arr.length; i < arrLen; i++) {
bufView[i] = arr[i].charCodeAt(0);
}
return buf;
}

function fromBuffer(buf: ArrayBuffer): string {
var out: string = '';
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = bufView.length; i < arrLen; i++) {
out += String.fromCharCode(bufView[i]);
}
return decodeSUTF(out);
}

//===============================================
//A Burrows Wheeler Transform of strings
//===============================================
function encodeBWT(data: string): any {
var size = data.length;
var buff = data + data;
var idx = _.range(size).sort(function(x, y){
for (var i = 0; i < size; i++) {
var r = buff[x + i].charCodeAt(0) - buff[y + i].charCodeAt(0);
if (r !== 0) return r;
}
return 0;
});

var top: number;
var work = _.reduce(_.range(size), function(memo, k: number) {
var p = idx[k];
if (p === 0) top = k;
memo.push(buff[p + size - 1]);
return memo;
}, []).join('');

return { top: top, data: work };
}

function decodeBWT(top: number, data: string): string { //JSON

var size = data.length;
var idx = _.range(size).sort(function(x, y){
var c = data[x].charCodeAt(0) - data[y].charCodeAt(0);
if (c === 0) return x - y;
return c;
});

var p = idx[top];
return _.reduce(_.range(size), function(memo){
memo.push(data[p]);
p = idx[p];
return memo;
}, []).join('');
}

//==================================================
// Two functions to do a dictionary effectiveness
// split of what to compress. This has the effect
// of applying an effective dictionary size bigger
// than would otherwise be.
//==================================================
function tally(data: string): number[] {
return _.reduce(data.split(''), function (memo: number[], charAt: string): number[] {
memo[charAt.charCodeAt(0)]++;//increase
return memo;
}, []);
}

function splice(data: string): string[] {
var acc = 0;
var counts = tally(data);
return _.reduce(counts, function(memo, count: number, key) {
memo.push(key + data.substring(acc, count + acc));
/* adds a seek char:
This assists in DB seek performance as it's the ordering char for the lzw block */
acc += count;
}, []);
}

//=====================================================
// A packer and unpacker with good efficiency
//=====================================================
// These are the ones to call, and the rest sre maybe
// useful, but can be considered as foundations for
// these functions. some block length management is
// built in.
function pack(data: any): any {
//limits
var str = JSON.stringify(data);
var chain = {};
if(str.length > 524288) {
chain = pack(str.substring(524288));
str = str.substring(0, 524288);
}
var bwt = encodeBWT(str);
var mix = splice(bwt.data);

mix = _.map(mix, encodeBounce);
return {
top: bwt.top,
/* tally: encode_tally(tally), */
mix: mix,
chn: chain
};
}

function unpack(got: any): any {
var top: number = got.top || 0;
/* var tally = got.tally; */
var mix: string[] = got.mix || [];

mix = _.map(mix, decodeBounce);
var mixr: string = _.reduce(mix, function(memo: string, lzw: string): string {
/* var key = lzw.charAt(0);//get seek char */
memo += lzw.substring(1, lzw.length);//concat
return memo;
}, '');
var chain = got.chn;
var res = decodeBWT(top, mixr);
if(_.has(chain, 'chn')) {
res += unpack(chain.chn);
}
return JSON.parse(res);
}

68k Continued …

A Continuation as it was Getting Long

The main thing in any 64 bit system is multi-processing. Multi-threading has already been covered. The CAS instruction is gone, and cache coherence is a big thing. So a supervisor level mutex? This is an obvious need. The extra long condition code register? How about a set of bits to set, and a stall if not zero? The bits could count down to zero over a number of cycles, leaving an opportunity to spin lock any memory location. Putting it in the status or condition code registers avoids the chip level cache shuffle. A non supervisor version would help user tasks. This avoids the need for atomic operations to a large extent, enough to not need them.

The fact a cache can reset pre-filled with “high memory” garbage, and not need empty bits, saves a little, but does need a little care on the compliance of boot sequence. A write back to the cache causes a cross core invalidate in most cache designs, There is an argument to set some status bit for ease of implementation. Resetting just the cache line would work, but remove a small section of memory from the 64 bit address space. A data invalidation queue would be useful to assist in the latency of reset to some synchronous opportunity, the countdown stall assisting in queue size management. As simultaneous write is a race condition, and a fail, by simultaneous deletion, the chip level mutexes must be used correctly. For the case where a cache load has to be performed, a double mutex count lock might have to be done. This infers that keeping the CPU ID somewhere to speed the second mutex lock might be beneficial.

Check cached, maybe repeat, set global, check cached, check global, maybe repeat, set cached, do is OK. A competing lock would fail maybe on the set cached if a time slice occurred just before it. An interrupt delay circuit would be needed for a number of instructions when the global is checked. The common access to the value either sets a stall timer or an interrupt stall timer, or a common timer register, with both behaviours. A synchronization window. Of course a badly written code piece could just set the cached, and ruin everything. But write range bounding would prevent this.

The next issue would be to sort out duplicating a read copy of a cache line into a local cache. This is so likely to be shared memory with the way software should be working. No process shares a cache line otherwise by sensible design of software. A read should get a clone from memory (to not clog a cache transfer bus if the other cache has not written). A cache should check for another cache written dirty, and send a read copy. A write should cause a delete invalidation on the other caches. If locks are correctly written this will preserve all writes. The cache bus only then has to send dirty copies, and in validations. Packet formats are then just an address, an RW bit, and the data width of a cache line (the last part just on the return bus), and the RW bit on the return bus is not used.

What happens when a second write happens, and a read only copy is in transit? It is invalid on arrival, but not responsible for any write back. This is L2 cache here. The L1 cache can also be data invalidated, but can stand the read delay. Given the write invalidate strategy, the packet in transit can be turned into an invalidate packet. The minor point is the synchronous assignment to the cache of the read copy at the same bus cycle edge as the write. This just needs a little logic to prevent this by “special address” forwarding. A sort of cancel on execute as it were.

It could be argued that sending over a read only copy is a bad idea and wastes by over connecting the caches. But to not send it would result in a L3 fetch of something not yet written to L3 yet, or the other option would be to stall based on address until it exits the cache on the other processor from under use of the associative address. That could take a very long time. The final issue is closing the mutex. The procedure is the same as opening, but using a different value to set cached. The mutex needs to be flushed? Nope, as the check cached will send a read only copy, and the set cached will invalidate the other dirty.

I think that makes for a minimal logic L2 cache. The L3 cache can be shared, and the T and S caches do not need coherence. Any sensible code would not need this. The D cache need invalidation only. The I cache should not need anything. When data is written to memory for later use as instructions, there is perhaps an issue, also with self modifying code, which frankly should be ignored as an issue. The L2 cache should get written with code, and a fetch should get a transferred read only copy. There would be no expectation of another write to the same memory location after scheduling execution.

There should be some cache coherence for DMA. There should be no expectation of write to a DMA block before the DMA output transfer is complete. The DMA therefore needs its own L2 cache “simulation” to receive read only updates, and to invalidate when DMA does an input source read. It is only slow off chip IO which necessitates a flush to L3 and main memory. Such things if handled well can allow the write back queue to only have elements entered onto it when hitting the L2 eviction cache. Considering that there is a block of memory which signals cache empty, it makes sense to just pass this write directly out, and latch for immediate continuation of execution, and stall only if the external bus cycle is not complete on a second write to those addresses. The input read on those addresses have to stall by default if a simplistic ideology is taken.

A more complex method is to indicate a pre-fetch. In a similar way to the 1 item buffer. I hope your IO does not read trigger events (unlikely, but write triggering is not unheard of). A delay 1 item buffer does help with a bit of for knowledge, and the end of bus cycle latching into this delay slot can be used to continue processing and routine setup. Address latches internally help with the clock domain crossing. The only disadvantage to this is that the processor decides the memory mapped device layout. It would be of benefit to shuffle this slow bus over a serial protocol. This makes an external PLA, micro-controller or FPGA suitable for running the slow bus, and keeps PIN count on the main CPU lower, allowing main memory to be placed and routed closer to the core CPU “silicon chip” die.

L4 Cache

The concept of cloud as galactic cache is perhaps a thing that some are new to. There is the sequential stride static column idea, which is good for some processing and in effect gives the I cache the highest performance, and shows the sequential stride the best. For tasks that flood the D cache, which is most when heavy optimization is used, The question becomes “is there a need for associative L4 cache off chip?” for an effective use of some MB of static single cycle RAM? With a bus size of 32 bits data, and a 64 bit addressing system, the tag would exceed the data in consuming the SRAM. If the memory bus is just 32 bit addressing, with some DMA SD card trickery for the high word, this still makes the tag large, but less than 50% of the SRAM usage. Burst mode in this sense is auto increment on the low addresses within the SRAM chip to stop copper trace charge power wastage, and a tag check wait state and DRAM access generator. The fact that DRAM is accessed in blocks makes the tag shrink further.

Yes it’s true, DRAM should have “some” associative SRAM as well as static column banks. But the net effect on performance is minimal. It’s more of a L3 eviction cache extender. Things down to the RAM disk has “read only” or even “no action” system files on it for all the memory, makes file buffers actually be files. Most people would not appreciate this level of detail, but it does however allow for easy contraction and expansion of the statically sized RAM disk. D cache thrashing is the problem to solve. Make it bigger. The SDRAM issue can be solved by putting a fair amount of it there, and placing a cloud in higher (or different address space) memory. The SD card interface is then the location of the network interface. Each part of memory is then divided into 3 parts at this level. A direct part, and associative part assisting another cache level and a tag part. the tag part and the associative assist for an interleave partition. Browser caches are a system level feature, not an issue for application developers. The flush cache is “new disk, new net” and beyond.

How to present a file browser picture of the web? FTP sort of did it for files. And a bookmark and search view seem like a good way of starting out. There maybe should be an unknown folder with some entropicly selected default folders based on wordage, and the web becomes seen in scope. At the level of a site index, the “tool type” should render a page view of the “folder”. The source view may also be relevant for some. People have seen this before, or close to it, and made some interesting research tools. I suppose it does not add up to profit per say, but has much more context in robot living allowance world.

This is the end of part II, and maybe more …

Addressing (An, Dn<<{size–16*SHS}.{DS<<size}, d8/24) so that the 2 bit DS field indexes one of 4 .W or larger.fields, truncated to .B, .W, .L or .Q with apparently even more options to spare. If SHS == 3 for example the DS > 1 have no extra effect. I’ll think on this (18th Feb 2017).
If SHS == 2 then DS == 3 has no effect. If SHS == 1 then DS == 3 does have an effect.
This can provide for 3 extra addressing modes not yet developed.

68k2-PC#d12 It’s got better! A fab addressing mode. More then 12 bits of embed-able opcode space remains in a 32 bit wide opcode extension, and almost all the 16 bit opcodes are used (all the co-processor F line slots). With not 3, but 10 extra addressing modes.