Sideloaded Kindle Fire (Pt II)

It’s been a few days, and the best benefit as yet has been the Libby app. This gets your library card hooked up to the database of books and audio books to lend. There is quite a lot of “feature fight” between the Amazon and Google. The latest being what happens when there is an update of permissions to an app. It seems that although it does suspend an Amazon overwrite, Amazon will not stop bugging you about some updates which are available (but I have yet to analyze exactly how much this consumes in bandwidth, as the firmware update seemed to consume loads of data).

There are some really nice apps which blossom on the 7″ screen, and were just too tiny on a phone. It is good to not be limited to such a small screen now. A list of apps which are almost essentials will follow, as some of the “features” such as adding files (.mp3 for example) to a folder on the Kindle SD, will just not show up. This is likely of the form of marketing from the South Park cable guy school of what no services? Buy here.

So after getting Play Store up and running, what to install?

  1. Chrome – for all your browsing needs.
  2. Outlook – I actually like this from Microsoft, and it does pick up gmail after Chrome is installed. (Not before).
  3. Google Docs and Sheets – these are quite good with Word and Excel files, but do need settings altering for saving in those formats. (Naughty Google).
  4. Facebook, Twitter – although Twitter does need to employ someone with experience of multi-notifications. Maybe it’s a birds everywhere logo-ego design.
  5. Skype – actually not that bad.
  6. USP Spectrum Emulator – don’t tell everyone. It’s excellent if you’re into your retro.
  7. Libby – an excellent public library resource.
  8. Free42 – some consider this to be the pinnacle of calculators before needing to crack open a Mathematica workbook. (An excellent open source reworking not using any HP ROMs). The simple facts that it has such a wide range of open source utilities already written for the backward compatible HP-41 range, and has over 1MB available memory reported, makes it worth getting a Kindle just for this.
  9. VLC – this is quite a nice player of audio and video, and does work with the screen off (with audio). It also reads those hidden by “the cable guy” directories.

If you purchased it using a free Amazon gift voucher, I agree with your choice. Only time will tell the battery service life and the resultant reliance on sticky gum as an assembly procedure for confounding future recycling farce-sillities.

Amiga Forever

Quite nice front end to WinUAE, with all the necessary ROMs and .adf files. There is then the process of finding some older freeware CD images, and you have a great retro Amiga system. Ummm, joystick. Hopefully there will be a cross compiler still working for the 3.1 OS, to provide back conversions of any software I may make.

There is then the process of setting up the 3.X image to my liking, and quite a lot is already done. There is also some games and other software included. This is good. The only strange thing was the 1GB hard file install of a vanilla OS, which did not seem to be used so I deleted it. … Seems it was for some lower version WB, and I’m not really using that.

After configuring the (included with some versions) Amiga Explorer it became easy to see the machine. This is not strictly necessary, but might be useful in future for remote access or some other test setup. I’ve got a Free Pascal for the 68k-amigaos, so I’ll be looking to do some back porting of anything I develop.

Interesting AROS PC edition refuses to boot today, a bad .vdi file error as seen by the guest OS. This may have been related to adding an extra network interface, but that should not do that, so I have to assume there is still some unwanted stability issues. AROS is not really necessary with a JIT compiler in WinUAE. I’m quite happy with the RTG 3.X Workbench as a general tool for doing things outside Windows and Linux. I’ll get on to installing some cover disk stuff from some “ancient” ISO files over Christmas. 600MB is a lot in Amiga world.

It does seem that OS 3.X needs an update to move the ENV-Archive off of the system boot disk, so as to fix the security to read-only. I can drop any “update” from the Windows 10 side. A good compromise is using Windows to control the access rights. This way I can be the administrator, and other users can have read-only permission on all but the Shared volume. This hopefully keeps the binaries intact for general guest users. It also provides a good excuse to fix up the Windows 10 security on the public user directory.

Arduino Fiddling

So I’ve decided on some Arduino soldering over the next few days. Fitting a fiddly SOIC F-RAM and some other components to a prototype shield. It should be fun. Then I have to fit some more code into the 32KB, and then work out a test bed and a box. I’m sure I saw a MIDI device code for the ATMega which does the USB handling. This could be useful later, but I think it does destroy the ability to connect direct with a terminal console.

The F-RAM adds 64KB of slower but fast enough memory to the Arduino, so that more complex projects can be assembled on this prototype. A display is also planned as is some analog pots and a button. Should be fun.

General Update

An update on the current progress of projects and general things here at KRT. I’ve set about checking out TypeScript for using in projects. It looks good, has some hidden pitfalls on finding .m.ts files for underscore for example, but in general looks good. I’m running it over some JS to get more of a feel. The audio VST project is moving slowly, at oscillators at the moment, with filters being done. I am looking into cache coherence algorithms and strategies to ease hardware design at the moment too. The 68k2 document mentioned in previous post is expanding with some of these ideas in having a “stall on value match” register, with a “touch since changed” bit in each cache line.

All good.

The Processor Design Document in Progress

TypeScript

Well I eventually managed to get a file using _.reduce() to compile without errors now. I’ll test it as soon as I’ve adapted in QUnit 2.0.1 so I can write my tests to the build as a pop up window, an perhaps back load a file to then be able to save the file from within the editor, and hence to become parser frame.

Representation

An excerpt from the 68k2 document as it’s progressing. An idea on UTF8 easy indexing and expansion.

“Reducing the size of this indexing array can recursively use the same technique, as long as movement between length encodings is not traversed for long sequences. This would require adding in a 2 length (11 bit form) and a 3 length (16 bit form) of common punctuation and spacing. Surrogate pair just postpones the issue and moves cache occupation to 25%, and not quite that for speed efficiency. This is why the simplified Chinese is common circa 2017, and surrogate processing has been abandoned in the Unicode specification, and replaced by characters in the surrogate representation space. Hand drawing the surrogates was likely the issue, and character parts (as individual parts) with double strike was considered a better rendering option.

UTF8 therefore has a possible 17 bit rendering for due to the extra bit freed by not needing a UTF32 representation. Should this be glyph space, or skip code index space, or a mix? 16 bit purity says skip code space. With common length (2 bit) and count (14 bit), allowing skips of between 16 kB and 48 kB through a document. The 4th combination of length? Perhaps the representation of the common punctuation without character length alterations. For 512 specials in the 2 length form and 65536 specials in the 3 length forms. In UTF16 there would be issues of decode, and uniqueness. This perhaps is best tackled by some render form meta characters in the original Unicode space. There is no way around it, and with skips maybe UTF8 would be faster.”


// tool.js 1.1.1
// https://kring.co.uk
// (c) 2016-2017 Simon Jackson, K Ring Technologies Ltd
// MIT, like as he said. And underscored 😀

import * as _ from 'underscore';

//==============================================================================
// LZW-compress a string
//==============================================================================
// The bounce parameter if true adds extra entries for faster dictionary growth.
// Usually LZW dictionary grows sub linear on input chars, and it is of note
// that after a BWT, the phrase contains a good MTF estimate and so maybe fine
// to append each of its chars to many dictionary entries. In this way the
// growth of entries becomes "almost" linear. The dictionary memory foot print
// becomes quadratic. Short to medium inputs become even smaller. Long input
// lengths may become slightly larger on not using dictionary entries integrated
// over input length, but will most likely be slightly smaller.

// DO NOT USE bounce (=false) IF NO BWT BEFORE.
// Under these conditions many unused dictionary entries will be wasted on long
// highly redundant inputs. It is a feature for pre BWT packed PONs.
//===============================================================================
function encodeLZW(data: string, bounce: boolean): string {
var dict = {};
data = encodeSUTF(data);
var out = [];
var currChar;
var phrase = data[0];
var codeL = 0;
var code = 256;
for (var i=1; i<data.length; i++) {
currChar=data[i];
if (dict['_' + phrase + currChar] != null) {
phrase += currChar;
}
else {
out.push(codeL = phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
if(code < 65536) {//limit
dict['_' + phrase + currChar] = code;
code++;
if(bounce && codeL != code - 2) {//code -- and one before would be last symbol out
_.each(phrase.split(''), function (chr) {
if(code < 65536) {
while(dict['_' + phrase + chr]) phrase += chr;
dict['_' + phrase + chr] = code;
code++;
}
});
}
}
phrase=currChar;
}
}
out.push(phrase.length > 1 ? dict['_'+phrase] : phrase.charCodeAt(0));
for (var i=0; i<out.length; i++) {
out[i] = String.fromCharCode(out[i]);
}
return out.join();
}

function encodeSUTF(s: string): string {
s = encodeUTF(s);
var out = [];
var msb: number = 0;
var two: boolean = false;
var first: boolean = true;
_.each(s, function(val) {
var k = val.charCodeAt(0);
if(k > 127) {
if (first == true) {
first = false;
two = (k & 32) == 0;
if (k == msb) return;
msb = k;
} else {
if (two == true) two = false;
else first = true;
}
}
out.push(String.fromCharCode(k));
});
return out.join();
}

function encodeBounce(s: string): string {
return encodeLZW(s, true);
}

//=================================================
// Decompress an LZW-encoded string
//=================================================
function decodeLZW(s: string, bounce: boolean): string {
var dict = {};
var dictI = {};
var data = (s + '').split('');
var currChar = data[0];
var oldPhrase = currChar;
var out = [currChar];
var code = 256;
var phrase;
for (var i=1; i<data.length; i++) {
var currCode = data[i].charCodeAt(0);
if (currCode < 256) {
phrase = data[i];
}
else {
phrase = dict['_'+currCode] ? dict['_'+currCode] : (oldPhrase + currChar);
}
out.push(phrase);
currChar = phrase.charAt(0);
if(code < 65536) {
dict['_'+code] = oldPhrase + currChar;
dictI['_' + oldPhrase + currChar] = code;
code++;
if(bounce && !dict['_'+currCode]) {//the special lag
_.each(oldPhrase.split(''), function (chr) {
if(code < 65536) {
while(dictI['_' + oldPhrase + chr]) oldPhrase += chr;
dict['_' + code] = oldPhrase + chr;
dictI['_' + oldPhrase + chr] = code;
code++;
}
});
}
}
oldPhrase = phrase;
}
return decodeSUTF(out.join(''));
}

function decodeSUTF(s: string): string {
var out = [];
var msb: number = 0;
var make: number = 0;
var from: number = 0;
_.each(s, function(val, idx) {
var k = val.charCodeAt(0);
if (k > 127) {
if (idx < from + make) return;
if ((k & 128) != 0) {
msb = k;
make = (k & 64) == 0 ? 2 : 3;
from = idx + 1;
} else {
from = idx;
}
out.push(String.fromCharCode(msb));
for (var i = from; i < from + make; i++) {
out.push(s[i]);
}
return;
} else {
out.push(String.fromCharCode(k));
}
});
return decodeUTF(out.join());
}

function decodeBounce(s: string): string {
return decodeLZW(s, true);
}

//=================================================
// UTF mangling with ArrayBuffer mappings
//=================================================
declare function escape(s: string): string;
declare function unescape(s: string): string;

function encodeUTF(s: string): string {
return unescape(encodeURIComponent(s));
}

function decodeUTF(s: string): string {
return decodeURIComponent(escape(s));
}

function toBuffer(str: string): ArrayBuffer {
var arr = encodeSUTF(str);
var buf = new ArrayBuffer(arr.length);
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = arr.length; i < arrLen; i++) {
bufView[i] = arr[i].charCodeAt(0);
}
return buf;
}

function fromBuffer(buf: ArrayBuffer): string {
var out: string = '';
var bufView = new Uint8Array(buf);
for (var i = 0, arrLen = bufView.length; i < arrLen; i++) {
out += String.fromCharCode(bufView[i]);
}
return decodeSUTF(out);
}

//===============================================
//A Burrows Wheeler Transform of strings
//===============================================
function encodeBWT(data: string): any {
var size = data.length;
var buff = data + data;
var idx = _.range(size).sort(function(x, y){
for (var i = 0; i < size; i++) {
var r = buff[x + i].charCodeAt(0) - buff[y + i].charCodeAt(0);
if (r !== 0) return r;
}
return 0;
});

var top: number;
var work = _.reduce(_.range(size), function(memo, k: number) {
var p = idx[k];
if (p === 0) top = k;
memo.push(buff[p + size - 1]);
return memo;
}, []).join('');

return { top: top, data: work };
}

function decodeBWT(top: number, data: string): string { //JSON

var size = data.length;
var idx = _.range(size).sort(function(x, y){
var c = data[x].charCodeAt(0) - data[y].charCodeAt(0);
if (c === 0) return x - y;
return c;
});

var p = idx[top];
return _.reduce(_.range(size), function(memo){
memo.push(data[p]);
p = idx[p];
return memo;
}, []).join('');
}

//==================================================
// Two functions to do a dictionary effectiveness
// split of what to compress. This has the effect
// of applying an effective dictionary size bigger
// than would otherwise be.
//==================================================
function tally(data: string): number[] {
return _.reduce(data.split(''), function (memo: number[], charAt: string): number[] {
memo[charAt.charCodeAt(0)]++;//increase
return memo;
}, []);
}

function splice(data: string): string[] {
var acc = 0;
var counts = tally(data);
return _.reduce(counts, function(memo, count: number, key) {
memo.push(key + data.substring(acc, count + acc));
/* adds a seek char:
This assists in DB seek performance as it's the ordering char for the lzw block */
acc += count;
}, []);
}

//=====================================================
// A packer and unpacker with good efficiency
//=====================================================
// These are the ones to call, and the rest sre maybe
// useful, but can be considered as foundations for
// these functions. some block length management is
// built in.
function pack(data: any): any {
//limits
var str = JSON.stringify(data);
var chain = {};
if(str.length > 524288) {
chain = pack(str.substring(524288));
str = str.substring(0, 524288);
}
var bwt = encodeBWT(str);
var mix = splice(bwt.data);

mix = _.map(mix, encodeBounce);
return {
top: bwt.top,
/* tally: encode_tally(tally), */
mix: mix,
chn: chain
};
}

function unpack(got: any): any {
var top: number = got.top || 0;
/* var tally = got.tally; */
var mix: string[] = got.mix || [];

mix = _.map(mix, decodeBounce);
var mixr: string = _.reduce(mix, function(memo: string, lzw: string): string {
/* var key = lzw.charAt(0);//get seek char */
memo += lzw.substring(1, lzw.length);//concat
return memo;
}, '');
var chain = got.chn;
var res = decodeBWT(top, mixr);
if(_.has(chain, 'chn')) {
res += unpack(chain.chn);
}
return JSON.parse(res);
}

 

68k Continued …

A Continuation as it was Getting Long

The main thing in any 64 bit system is multi-processing. Multi-threading has already been covered. The CAS instruction is gone, and cache coherence is a big thing. So a supervisor level mutex? This is an obvious need. The extra long condition code register? How about a set of bits to set, and a stall if not zero? The bits could count down to zero over a number of cycles, leaving an opportunity to spin lock any memory location. Putting it in the status or condition code registers avoids the chip level cache shuffle. A non supervisor version would help user tasks. This avoids the need for atomic operations to a large extent, enough to not need them.

The fact a cache can reset pre-filled with “high memory” garbage, and not need empty bits, saves a little, but does need a little care on the compliance of boot sequence. A write back to the cache causes a cross core invalidate in most cache designs, There is an argument to set some status bit for ease of implementation. Resetting just the cache line would work, but remove a small section of memory from the 64 bit address space. A data invalidation queue would be useful to assist in the latency of reset to some synchronous opportunity, the countdown stall assisting in queue size management. As simultaneous write is a race condition, and a fail, by simultaneous deletion, the chip level mutexes must be used correctly. For the case where a cache load has to be performed, a double mutex count lock might have to be done. This infers that keeping the CPU ID somewhere to speed the second mutex lock might be beneficial.

Check cached, maybe repeat, set global, check cached, check global, maybe repeat, set cached, do is OK. A competing lock would fail maybe on the set cached if a time slice occurred just before it. An interrupt delay circuit would be needed for a number of instructions when the global is checked. The common access to the value either sets a stall timer or an interrupt stall timer, or a common timer register, with both behaviours. A synchronization window. Of course a badly written code piece could just set the cached, and ruin everything. But write range bounding would prevent this.

The next issue would be to sort out duplicating a read copy of a cache line into a local cache. This is so likely to be shared memory with the way software should be working. No process shares a cache line otherwise by sensible design of software. A read should get a clone from memory (to not clog a cache transfer bus if the other cache has not written). A cache should check for another cache written dirty, and send a read copy. A write should cause a delete invalidation on the other caches. If locks are correctly written this will preserve all writes. The cache bus only then has to send dirty copies, and in validations. Packet formats are then just an address, an RW bit, and the data width of a cache line (the last part just on the return bus), and the RW bit on the return bus is not used.

What happens when a second write happens, and a read only copy is in transit? It is invalid on arrival, but not responsible for any write back. This is L2 cache here. The L1 cache can also be data invalidated, but can stand the read delay. Given the write invalidate strategy, the packet in transit can be turned into an invalidate packet. The minor point is the synchronous assignment to the cache of the read copy at the same bus cycle edge as the write. This just needs a little logic to prevent this by “special address” forwarding. A sort of cancel on execute as it were.

It could be argued that sending over a read only copy is a bad idea and wastes by over connecting the caches. But to not send it would result in a L3 fetch of something not yet written to L3 yet, or the other option would be to stall based on address until it exits the cache on the other processor from under use of the associative address. That could take a very long time. The final issue is closing the mutex. The procedure is the same as opening, but using a different value to set cached. The mutex needs to be flushed? Nope, as the check cached will send a read only copy, and the set cached will invalidate the other dirty.

I think that makes for a minimal logic L2 cache. The L3 cache can be shared, and the T and S caches do not need coherence. Any sensible code would not need this. The D cache need invalidation only. The I cache should not need anything. When data is written to memory for later use as instructions, there is perhaps an issue, also with self modifying code, which frankly should be ignored as an issue. The L2 cache should get written with code, and a fetch should get a transferred read only copy. There would be no expectation of another write to the same memory location after scheduling execution.

There should be some cache coherence for DMA. There should be no expectation of write to a DMA block before the DMA output transfer is complete. The DMA therefore needs its own L2 cache “simulation” to receive read only updates, and to invalidate when DMA does an input source read. It is only slow off chip IO which necessitates a flush to L3 and main memory. Such things if handled well can allow the write back queue to only have elements entered onto it when hitting the L2 eviction cache. Considering that there is a block of memory which signals cache empty, it makes sense to just pass this write directly out, and latch for immediate continuation of execution, and stall only if the external bus cycle is not complete on a second write to those addresses. The input read on those addresses have to stall by default if a simplistic ideology is taken.

A more complex method is to indicate a pre-fetch. In a similar way to the 1 item buffer. I hope your IO does not read trigger events (unlikely, but write triggering is not unheard of). A delay 1 item buffer does help with a bit of for knowledge, and the end of bus cycle latching into this delay slot can be used to continue processing and routine setup. Address latches internally help with the clock domain crossing. The only disadvantage to this is that the processor decides the memory mapped device layout. It would be of benefit to shuffle this slow bus over a serial protocol. This makes an external PLA, micro-controller or FPGA suitable for running the slow bus, and keeps PIN count on the main CPU lower, allowing main memory to be placed and routed closer to the core CPU “silicon chip” die.

L4 Cache

The concept of cloud as galactic cache is perhaps a thing that some are new to. There is the sequential stride static column idea, which is good for some processing and in effect gives the I cache the highest performance, and shows the sequential stride the best. For tasks that flood the D cache, which is most when heavy optimization is used, The question becomes “is there a need for associative L4 cache off chip?” for an effective use of some MB of static single cycle RAM? With a bus size of 32 bits data, and a 64 bit addressing system, the tag would exceed the data in consuming the SRAM. If the memory bus is just 32 bit addressing, with some DMA SD card trickery for the high word, this still makes the tag large, but less than 50% of the SRAM usage. Burst mode in this sense is auto increment on the low addresses within the SRAM chip to stop copper trace charge power wastage, and a tag check wait state and DRAM access generator. The fact that DRAM is accessed in blocks makes the tag shrink further.

Yes it’s true, DRAM should have “some” associative SRAM as well as static column banks. But the net effect on performance is minimal. It’s more of a L3 eviction cache extender. Things down to the RAM disk has “read only” or even “no action” system files on it for all the memory, makes file buffers actually be files. Most people would not appreciate this level of detail, but it does however allow for easy contraction and expansion of the statically sized RAM disk. D cache thrashing is the problem to solve. Make it bigger. The SDRAM issue can be solved by putting a fair amount of it there, and placing a cloud in higher (or different address space) memory. The SD card interface is then the location of the network interface. Each part of memory is then divided into 3 parts at this level. A direct part, and associative part assisting another cache level and a tag part. the tag part and the associative assist for an interleave partition. Browser caches are a system level feature, not an issue for application developers. The flush cache is “new disk, new net” and beyond.

How to present a file browser picture of the web? FTP sort of did it for files. And a bookmark and search view seem like a good way of starting out. There maybe should be an unknown folder with some entropicly selected default folders based on wordage, and the web becomes seen in scope. At the level of a site index, the “tool type” should render a page view of the “folder”. The source view may also be relevant for some. People have seen this before, or close to it, and made some interesting research tools. I suppose it does not add up to profit per say, but has much more context in robot living allowance world.

This is the end of part II, and maybe more …

 

  • Addressing (An, Dn<<{size16*SHS}.{DS<<size}, d8/24) so that the 2 bit DS field indexes one of 4 .W or larger.fields, truncated to .B, .W, .L or .Q with apparently even more options to spare. If SHS == 3 for example the DS > 1 have no extra effect. I’ll think on this (18th Feb 2017).
  • If SHS == 2 then DS == 3 has no effect. If SHS == 1 then DS == 3 does have an effect.
  • This can provide for 3 extra addressing modes not yet developed.

68k2-PC#d12 It’s got better! A fab addressing mode. More then 12 bits of embed-able opcode space remains in a 32 bit wide opcode extension, and almost all the 16 bit opcodes are used (all the co-processor F line slots). With not 3, but 10 extra addressing modes.

Where the 68k ISA went Wrong

An AMIGA valentine’s special.

With much hindsight it is possible to analyse where the micro computer scene and the processor market overlapped to allow Intel to overtake the market share. The selection of the EC range in the Amiga was put forward as cheaper, which although true, was indicative of the lack of micro computer use of floating point at that moment in time. The RISC ARM had some fast integer multiply and bit shuffle performance about the same era, which convince the use of some of the $Fxxx for all kind of MMU and FPU mish-mash, all not for micro users. Here are some classics of the day.

No one really needs floating point – The kids didn’t, it worked in software, and GPU cards do that kind of thing today if you really NEED it.
Memory management? Who needs it? – The kids didn’t. Why would the OS not sandbox? OK, I get it, Microsoft wrote Windows and needed the general protection fault. But was not that Intel? Surely it would be better to just have read and write protection trapping on certain addressing mode sequences only. Perhaps a bit array of jump entry points, and a forced check of address modes upto the “possible invalidation point”, with restrictions on “doing things” being a sand boxed software process. TLB based virtual memory is powerful, but “clouding up” does show that library calls come in on some level, and overlays just needed easier configuration management, and perhaps a pre-fetch gain.
D cache and I cache … lovely
Why do people use switch statements on highly index-able emulation code? – Why? If going the JIT way, then a flat JIT memory recompile, and a index of jump address mappings (differing encode lengths) would look good. But dynamic jumps would need a jump index table anyhow in any rational coding strategy. The best place to put a user mode trap? Any fixed jump can be resolved, and any computed jump can be sparse treed and block loaded on demand.
Given the hindsight, a more modern fast math multi wide vector unit would be a better addition than an FPU, with a fast trap jump flag in both the user and supervisor state, to quickly run some vector emulation code. Yes, a vector enable flag in user, and a legacy flag in supervisor mode. I don’t like the 3 bit co-processor address stuff either. Those bits would be better used for other things. There’s a whole $Axxx block to mess with for MMU style things.

For example if it was float register destination, then the standard instruction format would work, and 8 operation codes would be possible, with minimal impact on effective address decoding. The fact that saving floating point would require the reversal of the understanding of the “destination” and “source” would be minor, and make a spare addressing mode using the “direct D register implied as destination” encoding for float registers inside the coding. Given that the loading of FPU registers would also use the direct data register mode, Much better would be 8 single operand functions directly operating on float registers with an immediate direct target. Int to float would be via memory, and software.

So a simple 8 dyad, 8 monad FPU would seem possible, with all literal addressing modes fetching float literals. This would have been an effective use of the $Fxxx instruction space. Very easy to learn and apply. Better assembler support, and hence use incremental. Ideal for an FPGA solution, for ease of implementation given the “back catalog” of software not present on the majority of systems such as Amiga.

  • FADD, FSUB, FMUL, FDIV, FLOAD, FSAVE, FMAQ (a literal [post instruction field] multiply of the EA double and add to destination), FPLY (a literal [post instruction field] added and then multiply the EA double to destination).
  • FINRT (inverse root), FNEG, FLNP1, FEXM1, FATN, FSIN, FBRABS (branch on negative or NaN and make absolute, offset post instruction). These would be on the immediate mode.
  • The immediate mode could be to recall upto 64 constants on the An mode. There is no size field. Or have the 8 float registers on the An mode be floats, and have int to float and float to int on the Dn selections.

More Ideas for Optimization

Also the BCD instructions should be removed. I think there are more logical first/better choice instructions to replace them given all the RISC knowledge, and the fact that they were avoided quite often. Useful in COBOL, but what would it offer in the line of an actual 64 bit multiply or such? There seems a lot of squashed in opcodes which seem to clog the orthogonal nature. In fact losing EXG as well, could be useful in extending arithmetic operations in general on the multiplication side. It also has the advantage of removing a read-modify-write as an instruction primary. Removing BCD opcodes also reduces the fan in of the ALU.

What also surprises me is the fact that 68k manufacturers never use the $Axxx opcode space. Reserved for who? Given the OS incompatibility problem needing a recompile from source on a different system, obviously the system level is the place this space is reserved for. $Axxx speculative use is for another day. Maybe DCTs?

As to the nature of the general core, it would be efficient maybe have multiple dispatch, but a smaller core would be faster. A simple hybrid would be pipelined dispatch with stage skipping, A bigger single cycle cache structure would be more silicon efficient. To make optimal use of the data cache is more about compilation data layout. The instruction cache can be optimized by heavy code auto factorization via BWT pattern matching of a compiled binary. This does however throw some stress on the data cache as threading subroutine addresses get stacked.

The important point is though that there is no alteration of these addresses, and as such they would be better placed in a threading cache. They should be transient, and so have better write back only on spill characteristics. If the D-cache does not “dirty” on a subroutine return address stack, and all “pops” to the PC come from the T or threading cache, there is more efficiency beyond Harvard architecture. The question of where to put other stacked data items is maybe relevant. And killing the modification of a stacked return address is perhaps a good idea in general code. Adding a structured POP return is not needed. The T cache can have a higher density with just one spill address needing adding to system state.

This although adding something, does leave potentially a lot of unused “holes” in the D cache. If the pushes and pulls using SP are further placed in an S cache, a similar spill address, and code would still work without the holes, and dirty bits as long as functions did not stack relative outside the current function’s stack frame. In fact if stack frame indexing is used, the stack relative will always work correctly as a chained nest of indexing. This could be optimized by also using the S cache for link register relative indexing too. Some compatibility issues may arise with ill designed code. Perhaps use of LNK and PEA and some other modals should use an S cache counter. Or maybe this would break too many “bad old code” things, such that the supervisor “legacy” bit (or user “vector” bit) should use the D cache on the link register indirect modes when set, and the S cache switched off.

The next thing to add in the IDTS cache architecture is some mechanism for indirect threading to remove all the JSR instruction opcodes from the subroutine threaded list. Basically applying 1980’s memory techniques to the small cache sizes for fast cache speed. Entering threaded mode is technically the easy challenge, and exiting it at the point of the leaf subroutine code and re-entry on RTS is the most complex. Assuming NOP is one of the most useless instructions in a system is a bad idea, as it synchronizes bus transactions.

But as it does “nothing”, it is possibly an interesting case. The fact that addresses will be what is located in a threading list, a special address (or special addresses like the trap vector list), can be assumed to not form part of the threading, and so allow exit into regular instruction mode, by increasing the “thread counter” from zero, with each JSR/RTN pair inc/dec above zero. When the RTN takes the counter to zero, threading is re-entered. There should be an exception vector for counter overflow, and setting it to zero enters threading mode. Reset should set it to one. It would have to be set via program to zero via a supervisor trap.

Cache Multi-threading

Further optimization of the D cache for other structures which can be paralleled is maybe worth a little more time. A reasonable eviction buffer will fix some issues. The microcode can be split into parts for each pipeline stage, to help reduce its total size. Some pipeline sections might not even need a dual level microcode, or those that do can be 2 pipe staged. Some emphasis on branch prediction has been very effective in modern processor design, with the simplistic backwards taken a staple for many years. This is getting into speculative execution territory, but at this level can be considered speculative decode, up until the first register assignment, with some simplistic stall and flush on miss prediction.

To keep the interrupt latency low is another consideration. This is greatly simplified in a stall flush design, as instruction restart is assured if register commit has no speculation into register renaming. This keeps the processor complexity down, and silicon area down. Having a second “hyper thread” is the most logical way to expand on the core. This does increase area, but shares a lot of logic when cache misses happen. It does however place its own pressure on the caches, and effectively reduces their size per thread. So turning cache area into register decode area, and consequential access delays setting the core size before going multi-core.

An interesting possibility is flipping the hyper thread registers all at once, and having no extra decode. Then splitting the cache into 2, such that the performance of each half is via less decreasing returns, slightly more effective than keeping them unified and being pressured by each other. This requires a 1 cycle switch wait, but also toggles between waits on both when both cache halves are stalled. There is slightly more complexity in write back, but this not much of a relative problem. Multi-threading the OS queue is maybe also complex. or perhaps just as simple as interleaving the timer interrupt to be serviced by either core alternate style. Mutex locking the message passing is the only place this need be done, as the process model of sequential message queues sorts out most contention issues in an already multi-tasking system which has device and file locking.

Things that Might go at $Axxx

So far memory management, and things like DCT codec assistance. The 68k addressing modes support some very complex modes, and even then there is an extra bit set 0 (bit 3) in the full extension word format, which with a bit 8 set to 1, imply even more complex or different modes can be made. Along with the 5 I/IS reserved indications in a regular format full extension. This along with a mode field of 111 and register 101 to 111, give plenty of addressing mode expansion possibilities. The bit field instructions are kind of unnecessary on a large memory system, as it’s better to split bit fields, or use longer instruction sequences. Maybe some are useful, but with bit plane offsets, and modulo in video architecture, there reason for being is largely removed. It’s massive data structures packed then which might need them, for smaller than byte fields. There is also the size operands in some instructions which only have 3 possibilities in 2 bits. Is the extra bit state used? Some instructions already use the can’t target certain modes for extra functionality, such as ORI, suggesting a size long 10, for some extra state information, and a generic 11 size action, mixed up with some strange “opmode and can’t be immediate” of the general OR.

I’d suggest some opcodes are hidden, such as being able to use the other non targetable modes, in “read” for some operand to target an implicit register. It looks so, and would the ruling out of these be consuming unnecessary silicon area to trigger an illegal instruction trap? The PC relative indirect modes for example, although that would just waste literal displacement immediate to point to another literal, and cycle time too. So some more implicit registers could be handled. So I bet the four unused register values in an addressing mode 111 situation would be best as just 4 extra 32 bit registers. Not usable from everywhere, but generally quicker than memory. Not general purpose, and so some compilers would find them difficult to use, but useful in hand optimized inner loops, or for often used task global variables. There is some ideal to go back to plain 68k, with the few minor fixes (if I remember there was some problem with finding status, and supervisor mode that was fixed in later models), and rationalise the add-ons.

So memory needs read protect, write protect, and non execute protect as all the calculated jumps can be indirected and range checked via a library which is accepted. Write protection is perhaps the hardest, as read protection can be done on a per task basis protected secure segment accessed via a supervisor trap and an exception on reading below a memory bound, and using the task ID in a structure allocated below this bound. Generally write protection is not just own task write protection, but global. It also can be loaded up after the addressed data has been loaded into the cache and perhaps overwritten in the cache. The write back is the resolving time of this issue for speed. A simple “bitmapped pixel per page (or even cache line)” strategy would be fast. This uses the minimum of memory. The lack of TLB delays, and having to deal with page faults would make this good. The number of bits per page can be increased to provide protection rings. Or active write zones.

Effective Data

The idea that to extend the addressing scheme the extra I/IS field values in the regular format full extension maybe could do transforms on the data from or to the effective address. This throws an extra ALU into the bus unit. It could be useful, but like all the complex addressing modes it might not get used as much as the designer would hope. PEA and LEA would not use it, Maybe 4 of the codes use the four extra 32 bit registers that can be made? But this would not meld right. They are used in effect to switch on and off sets of things in the addressing mode. 101 to 111 should do a double memory indirect action with null, word and long outer displacements ([[bd,An]],Xn.SIZE*SCALE,od). The two remaining I/IS codes 100 and 100 with differing IS bits should really be part of a group of 3 reassigning 1:000 to ([[bd,An,Xn.SIZE*SCALE]],od) for some flexibility. A BD size field of 00 should also have some effect on the modes, by writing back into An the value An+bd, where the displacement is a word. This write back is even done when the base register is suppressed.

If full extension word format bit 3 is a logic 1, then bits 2 to 0 are an outer displacement register, and bit 6 selects if it is an address or data register. That finalizes the complexity of the addressing modes, apart from maybe utilizing base register suppress for non PC values, for all you crazy people out there. More likely a strange way of getting 8 extra addressing modes. Maybe it would be better to just redesign the extension word format. Such as making bits 7 to 0 have new meaning in an better format full extension. Making bits 5 to 0 be a nested register extension using the result of this effective address indirected as an extra base displacement, and bits 7 and 6 selecting a BD SIZE field, with 00 indicating another following brief extension word, 01 a word, and 10 a long (and maybe 64 bit at 11). The base displacement can be built up by nested brief extension too. Simple. There’s a bit to select a modification of the lower byte.

Just an order for unwinding the brief extension follows counter after processing the full extension register specs. As this ordering does not require a backward search. So a full address mode nesting (indirect displacements), with some simple immediate displacements. The T cache can be used for stacking the nesting, as no JSR will be done in an operand decode.

But no I won’t be building a new instruction table yet. There are other things to code. This is turning into a fill the whole instruction space challenge. For all numerically increasing opcodes upto and including MOVE, I’d have to remove MOVEP, as slow IO can be easily replaced by rarely used longer instruction sequences. I’d also rationalize all the SIZE field 11 CAS group instructions, and maybe move them. MOVES is also suitable for normalization, and having 3 system registers, one to address, one read/write data, and one read status gives the 4 possible bit codes for SIZE in a better MOVES. RTM and CALLM are fine. CMP2/CHK2 are another SIZE field along with CAS/CAS2. In total there are non target modes (6) times SIZE (4) times 2 instructions for prefixes in this range (so 48). If there are 3 auxiliary registers (111 101 to 111 111, as I miscounted 4 earlier), then 4*2 = 8, so 8 prefix words for immediate targets with SIZE.

The 3 auxiliary registers could be something else. Some kind of indirect auto active addressing mode? The 8 prefixes could be assigned to add 8 full width instruction extensions. Packing can be achieved by having them as sticky prefixes via a system status register, but that would need a system trap to exit, or an exit opcode within the added extension which would be preferred. For large data structures 111 101 addressing mode should do a d32 displacement, but using the PC is the only possibility so not that useful. Allowing PC relative writes, as the default. The CALLM should support a d16 for the possible parameter size. Allow BSET etc. on address registers instead of MOVEP, Also 2*32 bit control registers can be targeted similar to CCR.

There would be some rational to define a SIZE 11 meaning, and just eliminate everything which overrode it. OK, so it’s 64 bit. No CMP2, CHK2, MOVEP, CAS, CAS2, and MOVES is modified, d8 become upto d16, and all immediate mode do SR, CCR and a 32 bit and a 64 bit register. This stops the crazy overloading of bit fields. Excellent. This does affect things like MOVE from CCR in other opcodes, and some fiddling about would have to be done, as the immediate target would easily do the MOVE to.

Diagnosis

The SIZE field setting of 11 was not kept free for 64 bit use. Resulting in termination of the ISA expandability. As for addressing modes the terminal issue was not flexibility, the problem was not reassigning the XXX.W mode (difficult to use in most code on a multitasking system), and reassigning it as (d32, Reg). This makes 111 101 be (d32, PC). MOVEQ with bit 8 set, becomes a d24 loader. There are quite a few more. There is a nice solution to the bidirectional CCR and SR issue by using 111 110 and 111 111. This makes for various immediate mode prefixes, for the targeting instructions. In fact the general illegal trap should handle the un-handled ones. Or an emulation 64 bit overlap non supervisor exception would be triggered if the supervisor legacy bit was set. Quite a lot of instructions would have to change but for the A register restrictions on some, and immediate mode target.

That horrid idea to have AND and OR have a memory destination mode and the EOR fart-o-matic restrictions. “Well I couldn’t do long division due to having an OR condition I needed to sort out in the D cache.” Much better to retask the direction bit for DIV at all sizes, and MUL too, yes and all of the unsigned kind, or make the byte and word sizes, do long and quad signed, as this would be more useful. The remainder is not available on the quad size, nor the multiply high word, would make a good compromise.

68k2 is a spreadsheet I put together to analyse the ISA. Putting the old 68k instruction set in a vectored non supervisor soft interrupt clearing the compatibility flag, based on a vector with the following function added (b15.b14.b13.b12.b8.b7.b6.b5.b4.b3)<<2 and an extra indirect jump taken. It makes a 4096 byte table, and perhaps some more soft decode, and a little set of subroutines ended RTS (or a modified enter compatibility version). This does kind of set the coding of RTS in the mess, but this can be vectored. Doing so makes a bootstrap potentially easier.

Making the 64-bit idea apply to all registers, and have a massive GB SD automatic interface memory mapped so the low end is a disk hibernation space of the fast RAM, and memory limit interrupt then a 4GB system becomes natural. Even loading the DMA registers becomes automatable, with the direction of transfer too, even though the transfer size may cause confusion in code. It does mean a RAM disk driver would work, with a slight modification of the sector rw routines, and the initial format process changed some.

Part II

PocketCHIP

The CHIP is and excellent little computer. The production is having difficulties at present, but hopefully this will be sorted out soon. You can even get a keyboard and screen case thing, for mobile use. In short it’s a competitor to the Raspberry Pi, and has a number of benefits of not needing a WiFi dingle, and has SD disk built in, so not needing a card. What hasn’t it got by default? HDMI, many USB ports, free Mathematics, easy SD card swapping, and only has one OS choice. What extra does it have? Supports composite video, an easy portable case is available, has a built in battery charge circuit, costs less, has Bluetooth and WiFi builtin, and if bought with the pocket case with screen, it has some free game software thrown in. A more in depth out of the box review to follow.

Why my interest? Thin client possibilities of the future and present. So far so good. Quite intuitive to use, and only 17% (after an apt upgrade) of the disk used with the default install. The keyboard is a little fiddly but everything is there. There should be some good options for tools building on this. WiFi connects smoothly, no problems. The home key and ALT+TAB are effective for window management, and the included home screen defaults, are for good fun. Notable exceptions to the install are no immediate browsing, or JS/HTML client. No RTF or anything more than a simple text editor. But considering the pocketCHIP form factor, this would be an ideal typing on the go platform, where whipping out a laptop would be excessive, and a tablet might do, but a Windows Tablet will consume more power, and an Android Tablet would not quite be as customizable. Not that this all can’t be sorted given enough skill with Linux, and given the device is designed for such hackery. it’s likely a plus.

For cloud deploy, the missing features are some kind of file sync. something that would make corporate grade application prototyping a breeze. I think config files have to become config directories with date ordering priority for applied last relevance. It could make some kind of key code download architecture for situational setup. Of course this would have to be done at user level privileges. Maybe a git branch tag or ID. So I think the first install is git, and then someway to monkey patch the menu system. Looks like gksu or -A on sudo is going to have to be used to add a script, and then delete the script installer, so as to prevent the hijack of the script at a later date. This would make a wget over https pipe into bash as a one line boot strap.

Browsing the Web

I was fiddling about with all those usual suspects, but the team got it right surf is the browser to got for. Just press help, and then CTRL + G for to enter the matrix. The simplistic excellence. It’s even running a reasonable JavaScript. Sorted. Rendering is excellent with good HTML, and complete crap with absolute sized elements. Lucky a little surf config will reflow some bad sites.