CPU – KRT

General Update

An update on the current progress of projects and general things here at KRT. I’ve set about checking out TypeScript for using in projects. It looks good, has some hidden pitfalls on finding .m.ts files for underscore for example, but in general looks good. I’m running it over some JS to get more of a feel. The audio VST project is moving slowly, at oscillators at the moment, with filters being done. I am looking into cache coherence algorithms and strategies to ease hardware design at the moment too. The 68k2 document mentioned in previous post is expanding with some of these ideas in having a “stall on value match” register, with a “touch since changed” bit in each cache line.

All good.

The Processor Design Document in Progress

TypeScript

Well I eventually managed to get a file using _.reduce() to compile without errors now. I’ll test it as soon as I’ve adapted in QUnit 2.0.1 so I can write my tests to the build as a pop up window, an perhaps back load a file to then be able to save the file from within the editor, and hence to become parser frame.

Representation

An excerpt from the 68k2 document as it’s progressing. An idea on UTF8 easy indexing and expansion.

“Reducing the size of this indexing array can recursively use the same technique, as long as movement between length encodings is not traversed for long sequences. This would require adding in a 2 length (11 bit form) and a 3 length (16 bit form) of common punctuation and spacing. Surrogate pair just postpones the issue and moves cache occupation to 25%, and not quite that for speed efficiency. This is why the simplified Chinese is common circa 2017, and surrogate processing has been abandoned in the Unicode specification, and replaced by characters in the surrogate representation space. Hand drawing the surrogates was likely the issue, and character parts (as individual parts) with double strike was considered a better rendering option.

UTF8 therefore has a possible 17 bit rendering for due to the extra bit freed by not needing a UTF32 representation. Should this be glyph space, or skip code index space, or a mix? 16 bit purity says skip code space. With common length (2 bit) and count (14 bit), allowing skips of between 16 kB and 48 kB through a document. The 4^th combination of length? Perhaps the representation of the common punctuation without character length alterations. For 512 specials in the 2 length form and 65536 specials in the 3 length forms. In UTF16 there would be issues of decode, and uniqueness. This perhaps is best tackled by some render form meta characters in the original Unicode space. There is no way around it, and with skips maybe UTF8 would be faster.”


// tool.js 1.1.1
// https://kring.co.uk
// (c) 2016-2017 Simon Jackson, K Ring Technologies Ltd
// MIT, like as he said. And underscored :D

import * as _ from 'underscore';

//==============================================================================
// LZW-compress a string
//==============================================================================
// The bounce parameter if true adds extra entries for faster dictionary growth.
// Usually LZW dictionary grows sub linear on input chars, and it is of note
// that after a BWT, the phrase contains a good MTF estimate and so maybe fine
// to append each of its chars to many dictionary entries. In this way the
// growth of entries becomes "almost" linear. The dictionary memory foot print
// becomes quadratic. Short to medium inputs become even smaller. Long input
// lengths may become slightly larger on not using dictionary entries integrated
// over input length, but will most likely be slightly smaller.

// DO NOT USE bounce (=false) IF NO BWT BEFORE.
// Under these conditions many unused dictionary entries will be wasted on long
// highly redundant inputs. It is a feature for pre BWT packed PONs.
//===============================================================================
function encodeLZW(data: string, bounce: boolean): string {
var dict = {};
data = encodeSUTF(data);
var out = [];
var currChar;
var phrase = data[0];
var codeL = 0;
var code = 256;
for (var i=1; i<data.length; i++) {
currChar=data[i];
if (dict['_' + phrase + currChar] != null) {
phrase += currChar;
}
else {
out.push(codeL = phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
if(code < 65536) {//limit
dict['_' + phrase + currChar] = code;
code++;
if(bounce && codeL != code - 2) {//code -- and one before would be last symbol out
_.each(phrase.split(''), function (chr) {
if(code < 65536) {
while(dict['_' + phrase + chr]) phrase += chr;
dict['_' + phrase + chr] = code;
code++;
}
});
}
}
phrase=currChar;
}
}
out.push(phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
for (var i=0; i<out.length; i++) {
out[i] = String.fromCharCode(out[i]);
}
return out.join();
}

function encodeSUTF(s: string): string {
s = encodeUTF(s);
var out = [];
var msb: number = 0;
var two: boolean = false;
var first: boolean = true;
_.each(s, function(val) {
var k = val.charCodeAt(0);
if(k > 127) {
if (first == true) {
first = false;
two = (k & 32) == 0;
if (k == msb) return;
msb = k;
} else {
if (two == true) two = false;
else first = true;
}
}
out.push(String.fromCharCode(k));
});
return out.join();
}

function encodeBounce(s: string): string {
return encodeLZW(s, true);
}

//=================================================
// Decompress an LZW-encoded string
//=================================================
function decodeLZW(s: string, bounce: boolean): string {
var dict = {};
var dictI = {};
var data = (s + '').split('');
var currChar = data[0];
var oldPhrase = currChar;
var out = [currChar];
var code = 256;
var phrase;
for (var i=1; i<data.length; i++) {
var currCode = data[i].charCodeAt(0);
if (currCode < 256) {
phrase = data[i];
}
else {
phrase = dict['_'+currCode] ? dict['_'+currCode] : (oldPhrase + currChar);
}
out.push(phrase);
currChar = phrase.charAt(0);
if(code < 65536) {
dict['_'+code] = oldPhrase + currChar;
dictI['_' + oldPhrase + currChar] = code;
code++;
if(bounce && !dict['_'+currCode]) {//the special lag
_.each(oldPhrase.split(''), function (chr) {
if(code < 65536) {
while(dictI['_' + oldPhrase + chr]) oldPhrase += chr;
dict['_' + code] = oldPhrase + chr;
dictI['_' + oldPhrase + chr] = code;
code++;
}
});
}
}
oldPhrase = phrase;
}
return decodeSUTF(out.join(''));
}

function decodeSUTF(s: string): string {
var out = [];
var msb: number = 0;
var make: number = 0;
var from: number = 0;
_.each(s, function(val, idx) {
var k = val.charCodeAt(0);
if (k > 127) {
if (idx < from + make) return;
if ((k & 128) != 0) {
msb = k;
make = (k & 64) == 0 ? 2 : 3;
from = idx + 1;
} else {
from = idx;
}
out.push(String.fromCharCode(msb));
for (var i = from; i < from + make; i++) {
out.push(s[i]);
}
return;
} else {
out.push(String.fromCharCode(k));
}
});
return decodeUTF(out.join());
}

function decodeBounce(s: string): string {
return decodeLZW(s, true);
}

//=================================================
// UTF mangling with ArrayBuffer mappings
//=================================================
declare function escape(s: string): string;
declare function unescape(s: string): string;

function encodeUTF(s: string): string {
return unescape(encodeURIComponent(s));
}

function decodeUTF(s: string): string {
return decodeURIComponent(escape(s));
}

function toBuffer(str: string): ArrayBuffer {
var arr = encodeSUTF(str);
var buf = new ArrayBuffer(arr.length);
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = arr.length; i < arrLen; i++) {
bufView[i] = arr[i].charCodeAt(0);
}
return buf;
}

function fromBuffer(buf: ArrayBuffer): string {
var out: string = '';
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = bufView.length; i < arrLen; i++) {
out += String.fromCharCode(bufView[i]);
}
return decodeSUTF(out);
}

//===============================================
//A Burrows Wheeler Transform of strings
//===============================================
function encodeBWT(data: string): any {
var size = data.length;
var buff = data + data;
var idx = _.range(size).sort(function(x, y){
for (var i = 0; i < size; i++) {
var r = buff[x + i].charCodeAt(0) - buff[y + i].charCodeAt(0);
if (r !== 0) return r;
}
return 0;
});

var top: number;
var work = _.reduce(_.range(size), function(memo, k: number) {
var p = idx[k];
if (p === 0) top = k;
memo.push(buff[p + size - 1]);
return memo;
}, []).join('');

return { top: top, data: work };
}

function decodeBWT(top: number, data: string): string { //JSON

var size = data.length;
var idx = _.range(size).sort(function(x, y){
var c = data[x].charCodeAt(0) - data[y].charCodeAt(0);
if (c === 0) return x - y;
return c;
});

var p = idx[top];
return _.reduce(_.range(size), function(memo){
memo.push(data[p]);
p = idx[p];
return memo;
}, []).join('');
}

//==================================================
// Two functions to do a dictionary effectiveness
// split of what to compress. This has the effect
// of applying an effective dictionary size bigger
// than would otherwise be.
//==================================================
function tally(data: string): number[] {
return _.reduce(data.split(''), function (memo: number[], charAt: string): number[] {
memo[charAt.charCodeAt(0)]++;//increase
return memo;
}, []);
}

function splice(data: string): string[] {
var acc = 0;
var counts = tally(data);
return _.reduce(counts, function(memo, count: number, key) {
memo.push(key + data.substring(acc, count + acc));
/* adds a seek char:
This assists in DB seek performance as it's the ordering char for the lzw block */
acc += count;
}, []);
}

//=====================================================
// A packer and unpacker with good efficiency
//=====================================================
// These are the ones to call, and the rest sre maybe
// useful, but can be considered as foundations for
// these functions. some block length management is
// built in.
function pack(data: any): any {
//limits
var str = JSON.stringify(data);
var chain = {};
if(str.length > 524288) {
chain = pack(str.substring(524288));
str = str.substring(0, 524288);
}
var bwt = encodeBWT(str);
var mix = splice(bwt.data);

mix = _.map(mix, encodeBounce);
return {
top: bwt.top,
/* tally: encode_tally(tally), */
mix: mix,
chn: chain
};
}

function unpack(got: any): any {
var top: number = got.top || 0;
/* var tally = got.tally; */
var mix: string[] = got.mix || [];

mix = _.map(mix, decodeBounce);
var mixr: string = _.reduce(mix, function(memo: string, lzw: string): string {
/* var key = lzw.charAt(0);//get seek char */
memo += lzw.substring(1, lzw.length);//concat
return memo;
}, '');
var chain = got.chn;
var res = decodeBWT(top, mixr);
if(_.has(chain, 'chn')) {
res += unpack(chain.chn);
}
return JSON.parse(res);
}

Where the 68k ISA went Wrong

An AMIGA valentine’s special.

With much hindsight it is possible to analyse where the micro computer scene and the processor market overlapped to allow Intel to overtake the market share. The selection of the EC range in the Amiga was put forward as cheaper, which although true, was indicative of the lack of micro computer use of floating point at that moment in time. The RISC ARM had some fast integer multiply and bit shuffle performance about the same era, which convince the use of some of the $Fxxx for all kind of MMU and FPU mish-mash, all not for micro users. Here are some classics of the day.

No one really needs floating point – The kids didn’t, it worked in software, and GPU cards do that kind of thing today if you really NEED it.
Memory management? Who needs it? – The kids didn’t. Why would the OS not sandbox? OK, I get it, Microsoft wrote Windows and needed the general protection fault. But was not that Intel? Surely it would be better to just have read and write protection trapping on certain addressing mode sequences only. Perhaps a bit array of jump entry points, and a forced check of address modes upto the “possible invalidation point”, with restrictions on “doing things” being a sand boxed software process. TLB based virtual memory is powerful, but “clouding up” does show that library calls come in on some level, and overlays just needed easier configuration management, and perhaps a pre-fetch gain.
D cache and I cache … lovely
Why do people use switch statements on highly index-able emulation code? – Why? If going the JIT way, then a flat JIT memory recompile, and a index of jump address mappings (differing encode lengths) would look good. But dynamic jumps would need a jump index table anyhow in any rational coding strategy. The best place to put a user mode trap? Any fixed jump can be resolved, and any computed jump can be sparse treed and block loaded on demand.
Given the hindsight, a more modern fast math multi wide vector unit would be a better addition than an FPU, with a fast trap jump flag in both the user and supervisor state, to quickly run some vector emulation code. Yes, a vector enable flag in user, and a legacy flag in supervisor mode. I don’t like the 3 bit co-processor address stuff either. Those bits would be better used for other things. There’s a whole $Axxx block to mess with for MMU style things.

For example if it was float register destination, then the standard instruction format would work, and 8 operation codes would be possible, with minimal impact on effective address decoding. The fact that saving floating point would require the reversal of the understanding of the “destination” and “source” would be minor, and make a spare addressing mode using the “direct D register implied as destination” encoding for float registers inside the coding. Given that the loading of FPU registers would also use the direct data register mode, Much better would be 8 single operand functions directly operating on float registers with an immediate direct target. Int to float would be via memory, and software.

So a simple 8 dyad, 8 monad FPU would seem possible, with all literal addressing modes fetching float literals. This would have been an effective use of the $Fxxx instruction space. Very easy to learn and apply. Better assembler support, and hence use incremental. Ideal for an FPGA solution, for ease of implementation given the “back catalog” of software not present on the majority of systems such as Amiga.

FADD, FSUB, FMUL, FDIV, FLOAD, FSAVE, FMAQ (a literal [post instruction field] multiply of the EA double and add to destination), FPLY (a literal [post instruction field] added and then multiply the EA double to destination).
FINRT (inverse root), FNEG, FLNP1, FEXM1, FATN, FSIN, FBRABS (branch on negative or NaN and make absolute, offset post instruction). These would be on the immediate mode.
The immediate mode could be to recall upto 64 constants on the An mode. There is no size field. Or have the 8 float registers on the An mode be floats, and have int to float and float to int on the Dn selections.

More Ideas for Optimization

Also the BCD instructions should be removed. I think there are more logical first/better choice instructions to replace them given all the RISC knowledge, and the fact that they were avoided quite often. Useful in COBOL, but what would it offer in the line of an actual 64 bit multiply or such? There seems a lot of squashed in opcodes which seem to clog the orthogonal nature. In fact losing EXG as well, could be useful in extending arithmetic operations in general on the multiplication side. It also has the advantage of removing a read-modify-write as an instruction primary. Removing BCD opcodes also reduces the fan in of the ALU.

What also surprises me is the fact that 68k manufacturers never use the $Axxx opcode space. Reserved for who? Given the OS incompatibility problem needing a recompile from source on a different system, obviously the system level is the place this space is reserved for. $Axxx speculative use is for another day. Maybe DCTs?

As to the nature of the general core, it would be efficient maybe have multiple dispatch, but a smaller core would be faster. A simple hybrid would be pipelined dispatch with stage skipping, A bigger single cycle cache structure would be more silicon efficient. To make optimal use of the data cache is more about compilation data layout. The instruction cache can be optimized by heavy code auto factorization via BWT pattern matching of a compiled binary. This does however throw some stress on the data cache as threading subroutine addresses get stacked.

The important point is though that there is no alteration of these addresses, and as such they would be better placed in a threading cache. They should be transient, and so have better write back only on spill characteristics. If the D-cache does not “dirty” on a subroutine return address stack, and all “pops” to the PC come from the T or threading cache, there is more efficiency beyond Harvard architecture. The question of where to put other stacked data items is maybe relevant. And killing the modification of a stacked return address is perhaps a good idea in general code. Adding a structured POP return is not needed. The T cache can have a higher density with just one spill address needing adding to system state.

This although adding something, does leave potentially a lot of unused “holes” in the D cache. If the pushes and pulls using SP are further placed in an S cache, a similar spill address, and code would still work without the holes, and dirty bits as long as functions did not stack relative outside the current function’s stack frame. In fact if stack frame indexing is used, the stack relative will always work correctly as a chained nest of indexing. This could be optimized by also using the S cache for link register relative indexing too. Some compatibility issues may arise with ill designed code. Perhaps use of LNK and PEA and some other modals should use an S cache counter. Or maybe this would break too many “bad old code” things, such that the supervisor “legacy” bit (or user “vector” bit) should use the D cache on the link register indirect modes when set, and the S cache switched off.

The next thing to add in the IDTS cache architecture is some mechanism for indirect threading to remove all the JSR instruction opcodes from the subroutine threaded list. Basically applying 1980’s memory techniques to the small cache sizes for fast cache speed. Entering threaded mode is technically the easy challenge, and exiting it at the point of the leaf subroutine code and re-entry on RTS is the most complex. Assuming NOP is one of the most useless instructions in a system is a bad idea, as it synchronizes bus transactions.

But as it does “nothing”, it is possibly an interesting case. The fact that addresses will be what is located in a threading list, a special address (or special addresses like the trap vector list), can be assumed to not form part of the threading, and so allow exit into regular instruction mode, by increasing the “thread counter” from zero, with each JSR/RTN pair inc/dec above zero. When the RTN takes the counter to zero, threading is re-entered. There should be an exception vector for counter overflow, and setting it to zero enters threading mode. Reset should set it to one. It would have to be set via program to zero via a supervisor trap.

Cache Multi-threading

Further optimization of the D cache for other structures which can be paralleled is maybe worth a little more time. A reasonable eviction buffer will fix some issues. The microcode can be split into parts for each pipeline stage, to help reduce its total size. Some pipeline sections might not even need a dual level microcode, or those that do can be 2 pipe staged. Some emphasis on branch prediction has been very effective in modern processor design, with the simplistic backwards taken a staple for many years. This is getting into speculative execution territory, but at this level can be considered speculative decode, up until the first register assignment, with some simplistic stall and flush on miss prediction.

To keep the interrupt latency low is another consideration. This is greatly simplified in a stall flush design, as instruction restart is assured if register commit has no speculation into register renaming. This keeps the processor complexity down, and silicon area down. Having a second “hyper thread” is the most logical way to expand on the core. This does increase area, but shares a lot of logic when cache misses happen. It does however place its own pressure on the caches, and effectively reduces their size per thread. So turning cache area into register decode area, and consequential access delays setting the core size before going multi-core.

An interesting possibility is flipping the hyper thread registers all at once, and having no extra decode. Then splitting the cache into 2, such that the performance of each half is via less decreasing returns, slightly more effective than keeping them unified and being pressured by each other. This requires a 1 cycle switch wait, but also toggles between waits on both when both cache halves are stalled. There is slightly more complexity in write back, but this not much of a relative problem. Multi-threading the OS queue is maybe also complex. or perhaps just as simple as interleaving the timer interrupt to be serviced by either core alternate style. Mutex locking the message passing is the only place this need be done, as the process model of sequential message queues sorts out most contention issues in an already multi-tasking system which has device and file locking.

Things that Might go at $Axxx

So far memory management, and things like DCT codec assistance. The 68k addressing modes support some very complex modes, and even then there is an extra bit set 0 (bit 3) in the full extension word format, which with a bit 8 set to 1, imply even more complex or different modes can be made. Along with the 5 I/IS reserved indications in a regular format full extension. This along with a mode field of 111 and register 101 to 111, give plenty of addressing mode expansion possibilities. The bit field instructions are kind of unnecessary on a large memory system, as it’s better to split bit fields, or use longer instruction sequences. Maybe some are useful, but with bit plane offsets, and modulo in video architecture, there reason for being is largely removed. It’s massive data structures packed then which might need them, for smaller than byte fields. There is also the size operands in some instructions which only have 3 possibilities in 2 bits. Is the extra bit state used? Some instructions already use the can’t target certain modes for extra functionality, such as ORI, suggesting a size long 10, for some extra state information, and a generic 11 size action, mixed up with some strange “opmode and can’t be immediate” of the general OR.

I’d suggest some opcodes are hidden, such as being able to use the other non targetable modes, in “read” for some operand to target an implicit register. It looks so, and would the ruling out of these be consuming unnecessary silicon area to trigger an illegal instruction trap? The PC relative indirect modes for example, although that would just waste literal displacement immediate to point to another literal, and cycle time too. So some more implicit registers could be handled. So I bet the four unused register values in an addressing mode 111 situation would be best as just 4 extra 32 bit registers. Not usable from everywhere, but generally quicker than memory. Not general purpose, and so some compilers would find them difficult to use, but useful in hand optimized inner loops, or for often used task global variables. There is some ideal to go back to plain 68k, with the few minor fixes (if I remember there was some problem with finding status, and supervisor mode that was fixed in later models), and rationalise the add-ons.

So memory needs read protect, write protect, and non execute protect as all the calculated jumps can be indirected and range checked via a library which is accepted. Write protection is perhaps the hardest, as read protection can be done on a per task basis protected secure segment accessed via a supervisor trap and an exception on reading below a memory bound, and using the task ID in a structure allocated below this bound. Generally write protection is not just own task write protection, but global. It also can be loaded up after the addressed data has been loaded into the cache and perhaps overwritten in the cache. The write back is the resolving time of this issue for speed. A simple “bitmapped pixel per page (or even cache line)” strategy would be fast. This uses the minimum of memory. The lack of TLB delays, and having to deal with page faults would make this good. The number of bits per page can be increased to provide protection rings. Or active write zones.

Effective Data

The idea that to extend the addressing scheme the extra I/IS field values in the regular format full extension maybe could do transforms on the data from or to the effective address. This throws an extra ALU into the bus unit. It could be useful, but like all the complex addressing modes it might not get used as much as the designer would hope. PEA and LEA would not use it, Maybe 4 of the codes use the four extra 32 bit registers that can be made? But this would not meld right. They are used in effect to switch on and off sets of things in the addressing mode. 101 to 111 should do a double memory indirect action with null, word and long outer displacements ([[bd,An]],Xn.SIZE*SCALE,od). The two remaining I/IS codes 100 and 100 with differing IS bits should really be part of a group of 3 reassigning 1:000 to ([[bd,An,Xn.SIZE*SCALE]],od) for some flexibility. A BD size field of 00 should also have some effect on the modes, by writing back into An the value An+bd, where the displacement is a word. This write back is even done when the base register is suppressed.

If full extension word format bit 3 is a logic 1, then bits 2 to 0 are an outer displacement register, and bit 6 selects if it is an address or data register. That finalizes the complexity of the addressing modes, apart from maybe utilizing base register suppress for non PC values, for all you crazy people out there. More likely a strange way of getting 8 extra addressing modes. Maybe it would be better to just redesign the extension word format. Such as making bits 7 to 0 have new meaning in an better format full extension. Making bits 5 to 0 be a nested register extension using the result of this effective address indirected as an extra base displacement, and bits 7 and 6 selecting a BD SIZE field, with 00 indicating another following brief extension word, 01 a word, and 10 a long (and maybe 64 bit at 11). The base displacement can be built up by nested brief extension too. Simple. There’s a bit to select a modification of the lower byte.

Just an order for unwinding the brief extension follows counter after processing the full extension register specs. As this ordering does not require a backward search. So a full address mode nesting (indirect displacements), with some simple immediate displacements. The T cache can be used for stacking the nesting, as no JSR will be done in an operand decode.

But no I won’t be building a new instruction table yet. There are other things to code. This is turning into a fill the whole instruction space challenge. For all numerically increasing opcodes upto and including MOVE, I’d have to remove MOVEP, as slow IO can be easily replaced by rarely used longer instruction sequences. I’d also rationalize all the SIZE field 11 CAS group instructions, and maybe move them. MOVES is also suitable for normalization, and having 3 system registers, one to address, one read/write data, and one read status gives the 4 possible bit codes for SIZE in a better MOVES. RTM and CALLM are fine. CMP2/CHK2 are another SIZE field along with CAS/CAS2. In total there are non target modes (6) times SIZE (4) times 2 instructions for prefixes in this range (so 48). If there are 3 auxiliary registers (111 101 to 111 111, as I miscounted 4 earlier), then 4*2 = 8, so 8 prefix words for immediate targets with SIZE.

The 3 auxiliary registers could be something else. Some kind of indirect auto active addressing mode? The 8 prefixes could be assigned to add 8 full width instruction extensions. Packing can be achieved by having them as sticky prefixes via a system status register, but that would need a system trap to exit, or an exit opcode within the added extension which would be preferred. For large data structures 111 101 addressing mode should do a d32 displacement, but using the PC is the only possibility so not that useful. Allowing PC relative writes, as the default. The CALLM should support a d16 for the possible parameter size. Allow BSET etc. on address registers instead of MOVEP, Also 2*32 bit control registers can be targeted similar to CCR.

There would be some rational to define a SIZE 11 meaning, and just eliminate everything which overrode it. OK, so it’s 64 bit. No CMP2, CHK2, MOVEP, CAS, CAS2, and MOVES is modified, d8 become upto d16, and all immediate mode do SR, CCR and a 32 bit and a 64 bit register. This stops the crazy overloading of bit fields. Excellent. This does affect things like MOVE from CCR in other opcodes, and some fiddling about would have to be done, as the immediate target would easily do the MOVE to.

Diagnosis

The SIZE field setting of 11 was not kept free for 64 bit use. Resulting in termination of the ISA expandability. As for addressing modes the terminal issue was not flexibility, the problem was not reassigning the XXX.W mode (difficult to use in most code on a multitasking system), and reassigning it as (d32, Reg). This makes 111 101 be (d32, PC). MOVEQ with bit 8 set, becomes a d24 loader. There are quite a few more. There is a nice solution to the bidirectional CCR and SR issue by using 111 110 and 111 111. This makes for various immediate mode prefixes, for the targeting instructions. In fact the general illegal trap should handle the un-handled ones. Or an emulation 64 bit overlap non supervisor exception would be triggered if the supervisor legacy bit was set. Quite a lot of instructions would have to change but for the A register restrictions on some, and immediate mode target.

That horrid idea to have AND and OR have a memory destination mode and the EOR fart-o-matic restrictions. “Well I couldn’t do long division due to having an OR condition I needed to sort out in the D cache.” Much better to retask the direction bit for DIV at all sizes, and MUL too, yes and all of the unsigned kind, or make the byte and word sizes, do long and quad signed, as this would be more useful. The remainder is not available on the quad size, nor the multiply high word, would make a good compromise.

68k2 is a spreadsheet I put together to analyse the ISA. Putting the old 68k instruction set in a vectored non supervisor soft interrupt clearing the compatibility flag, based on a vector with the following function added (b15.b14.b13.b12.b8.b7.b6.b5.b4.b3)<<2 and an extra indirect jump taken. It makes a 4096 byte table, and perhaps some more soft decode, and a little set of subroutines ended RTS (or a modified enter compatibility version). This does kind of set the coding of RTS in the mess, but this can be vectored. Doing so makes a bootstrap potentially easier.

Making the 64-bit idea apply to all registers, and have a massive GB SD automatic interface memory mapped so the low end is a disk hibernation space of the fast RAM, and memory limit interrupt then a 4GB system becomes natural. Even loading the DMA registers becomes automatable, with the direction of transfer too, even though the transfer size may cause confusion in code. It does mean a RAM disk driver would work, with a slight modification of the sector rw routines, and the initial format process changed some.

Part II

JDeveloper and Intel Python

The JDeveloper environment looks good. Nice work Oracle, and some of the Borland classic JBuilder. This tool look more like how I’d use an IDE. I’ve been looking at other technologies for computer development, and a recent Intel offering (for personal use free) is the MKL backed Intel Python. It needs at least an SSE4.2 supporting chip, but does have all that is needed to run the development on Xeon Phi Knights Landing. 72 cores and 144 vector processing AVX-512 engines. Multi Tflops stuff. For the developer this is perhaps the easiest way to start HPC, as through Cython and eventually C, the best performance can be had. Maybe FPGAs will help, and tools are available for that too. I’ve seen some good demonstrations, and maybe some clients with complex or hard problems would need this.

All this parallel stuff got me thinking of Kahan sums, and simulation of incompressibles by having a high speed of sound in a compressible, and the doing a compulsory diffusion to damp oscillation, and a pressure impulse (Pa s) handling of inertial failure of containers. It might reduce the non-locality of certain simulations, and actually act to simulate pressure hammer effects.

I’ve also recently got back into the idea of using Free Pascal for some of my projects maybe. There is now good JNI support, and even JVM targeting. I maen it’s very possible to use C for this kind of thing, but the FPC IDE and Lazarus are quick to build, with incremental unit compilation and many other features which make it good competition for general coding. Some would think it old hat, but the ease of use is excellent with much type checking, and no insistence on everything being a class. Units are very modular that way. The support for quite a few Pascal flavours is also good.