Thursday, September 3, 2009

a new data model

In an attempt to make NewScript accessible to more people, I've been playing around with making the language more compatible with things that people find familiar. From this point of view, I'm marrying a few ideas from Javascript with the NewScript object model that was implemented. When combined with an ad hoc programming style, the combination is really quite interesting from a "other people might use this" point of view.

Where is NewScript headed


One of the things I've been looking at is all the code I've written over the past year and thinking about what is waste, and what is productive. Here's a sample of some code from a project that I'm working on right now:

function display_account(a) {
$('main').clear();
$('main').add(h3().text(a.company)).div().inset().table(
).row(node('Contact:'),node(a.contact)
).row(node('Address:'),node(a.address)
).row(node('Phone:'),node(a.phone.phone())
).row(node('Cell:'),node(a.cell.phone()));
$('main').button('Add to Call List',function() { mark(this.data) }, a.id);
}

This is the sort of bog standard html generation stuff that can either run server side or client side if you have the right functions declared. As you can see it consists of 3 statements, and a lot of function calls. If we look at this code from the view point of trying to make it read nicer I might want to write something like:

display_account : (a) {
Main clear;
h3 text (a company));
div inset table
row ( node ('Contact' ), node ( a contact ) )
row ( node ('Address'), node ( a address) )
row ( node ('Phone'), node (a phone number) )
row ( node ('Cell'), node (a cell number ) );
button ( 'Add to Call List', { mark ( this data ) }, (a id) )
}

The objective here is to set the object context once, make sure all of the messages get passed to the current context, either at the top level, Main, or to the value on the stack. Parentheses are used only to designate application of arguments as a matter of order of operation. Declarations of all types are simply ditched, and the binding of a key to a value in the top level namespace is handled as a matter of using a key : value operator. Lambda functions are declared with {} brackets, with an optional () prefix. Application of a lambda could be expressed as (){}() which looks a little perverse but that's an easy pattern to spot in code, and will also be exceedingly rare. Additionally ; functions as an operator which effectively discards the values on the top of the stack. The first ; discards the h3 element, the second ; discards the table and div, and the value returned at the end is the value of button which is the object Main itself! In this sense, each piece of the puzzle serves an actual purpose and produces rather clean code. It may look a bit like Smalltalk, with the removal of so much punctuation, but the more familiar order of operation of the Algol style languages should make it more accessible to a wider array of programmers.

Tuesday, July 14, 2009

64 bit NewScript

So this morning, I began on another aspect of the rewrite of the native NewScript implementation for x86. And today's revelation was that I could with little effort do both a NewScript32 and a NewScript64 native implementation. The primary difference between the two boil down to the Core word macros and the layout of some elements in memory. While the 32 bit version is very compact, in both object code produced and description, the 64 bit version gets to store most of the core system variables in registers with the addition of 8 general purpose registers.

JIT Thoughts



So I've been playing around with implementing a version of the NewScript VM in Smalltalk, in addition to the C implementation, largely just as an excuse to put a NewScript parser/compiler combo in a Squeak image. It would give me a nice proof of concept of some of the extensions I'm making to my own Smalltalk variant, and at the same time test out some of the compiler techniques I'm using in NewScript. What I find most amusing is that I could very well implement the Smalltalk VM on top of the NewScript VM, and just JIT Squeak that way.

One of the issues, I've had with the current approach to JIT engines in more mainstream applications is that they're all still plagued by a religious notion of code and data. When you really think about JIT, code and data are false dichotomies. Code is data and data is code, all that distinguishes the two is how you view them. If the CPU is executing something it is code, even if you intended it to be data. Likewise, if you're generating machine code, or byte code, or objects, or spreadsheets, they're all just data that will be interpreted by some form of computer, be it machine, virtual machine, or state machine. After all, your hardware doesn't ascribe meaning to any of those bits, it just pushes electrons around.

So my intention for "JIT" for NewScript is just to compile code as necessary and throw it away when its done. This is no different from copying a file from disk into memory, linking it against some memory resident libraries, and then executing it. That's all the NewScript compiler does, except you replace files form disk with memory, and libraries with more memory. I also want to avoid the Source > AST > IR > Byte/Machine Code translations common in other environments. Source > BNF > AST is just asinine. You're using a computer that parses your source to color hi-light it, WTF are you doing storing it as byte strings? And why are you generating a parser for a syntax notation with yet another syntax notation? Chickens lay eggs after all. Then why do you need another intermediate from, when an AST is an intermediate form? To help perform automatic optimizations? That could be done manipulating the AST. But manipulating the AST is just reorganizing your source code. So why not just optimize the source?

Better Tools



So what I want for NewScript is a better tool set. The short list is:


  • An optimizing editor, that peep-hole optimizes your source code. For example, type "dup drop" and both words go poof!

  • An editor that represents source as your AST, or is it an editor that edits your AST as source

  • Disposable binaries with ad hoc compilation, where in code is compiled and executed on the fly for specific data values. The entire programs are just tossed out upon completion

  • An editor that allows you to layout memory as you'd like to, graphically, and interactively



What this means is since the editor is representing and manipulating the source as if it were the AST, and the source and compilation facilities are available at all times, we don't bother treating binaries as sacrosanct. Rather, a binary is just some code we need the CPU to run to manipulate some data. This allows us to compile only that code we need to perform the immediate processing requirements, and change the code whenever the environment changes. Since the editor has already preoptimized both the source and the representation, compilation effectively is just mapping source through a function that rolls out machine instructions, pre-canned and ready to run.

Friday, July 3, 2009

NewScript in C Revisited

So it has been a while since I posted any of the updated progress in the NewScript in C front. And I'm pleased to announce that there is yet another version of the NewScript programming environment in C! What has stayed the same is the instruction encoding, literal encoding, and the main execution loop. What is different is the instruction set, naming conventions, code size, and formatting. As I've been doing more and more programming with the NewScript compilers, I've been revising the instruction set to better match the types of code that I've been writing. Some of the issues are cosmetic, but most of the changes have to do with reducing the semantic gap between what you can say and what you'd want to say. In this spirit, I've espoused the following design principles:

NewScript - Design Principles



  • Personal Mastery - anyone should be able to learn the system in its entirety

  • Direct Manipulation - code is data, data is directly manipulable by the programmer.

  • Contextual Semantics - meanings are contextual, as in all human languages

  • Expressive and Concise - both code and documentation must convey meaning for humans

  • Comprehendible by Design - the software is simple enough to be fully understood

  • Sustainable Software - the software must be clean, efficient, and maintainable



NewScript - The Instruction Set


The new instruction set has 32 methods, broken down into 5 basic categories:

  1. Flow
     . ! ; ? ( ) 

  2. Stack
     _ <- -> ^ : 

  3. Register
     % #% , # @ #@ $ #$ 

  4. Math
     - + * /  << >> 

  5. Logic
     ~ & | \ = < > 



This instruction set is the list of methods for the Core object, which defines the base context for all code. If you read the documentation on the old NewScript instruction set, or used the current web app, quite a few of the terms have changed. There is also 10 few instructions, than in the current webapp version. Of particular note is how flow control has changed. The words in the flow control method list consist largely of typical punctuation marks. This is intentional, as to bridge the gap between English and your programs.

Flow



. return

The period represents return, and will return to the value on the top of the return stack

! call

The exclamation point represents a function call to the value on the top of the data stack. This is useful for vectoring.

; continue

The semicolon produces a coroutine call, by calling the value on the top of the return stack. This allows you to write simple green threaded code.

? conditional branch

The question mark will branch to the address on the top of the stack if the next value on the stack is non-zero

( for

The left parenthesis mark indicates the start of a counted loop, and pops the loop count off of the stack

) next

The right parenthesis mark decrements the loop count and tests to see if it is zero, if not it jumps to the start of the loop whose address is on the return stack



Stack



_ drop

The underscore drops the value on the top of the stack. Since the stack is 8 cells and circular, 8 drops in a row does a nop

<- push

The left arrow pushes the top of the data stack onto the return stack

-> pop

The right arrow pops the top of the return stack onto the data stack

^ over

The carrot copies the next value on the data stack above the top of the stack, hence ab ^ aba

: duplicate

The color duplicates the top of the data stack



Register



% object

The percentage mark sets the obj register to the top of the stack

#% get object

The combination hash percentage gets the value in the obj register

, set

The comma stores the value on the top of the stack to the address stored in the destination register, $, and increments the destination register

# get

The hash mark fetches the value stored at the address contained in the source register, @, and increments the source register

@ source

The amphora sets the value of the source register to the top of the stack

#@ get source

The hash amphora combo fetchs the value of the source register

$ destination

The dollar sign sets the value of the destination register to the value on top of the stack

#$ get destination

The hash dollar sign combo fetches the value of the destination register



Math and Logic



The math and logic methods are rather easy to understand. Unlike the funny named ones above, these are pretty straight forward:








MathOperationLogicOperation
-negate~compliment
+add&and
*multiply|or
/divide & modulus\xor
<<shift left=equality
>>shift right< >less than, greater than


The only tricky thing here is that there is no subtraction, and the valid range of literal values ranges from 0 - 07fffffff in hexidecimal notation. So if you want to perform subtraction, what is typically done is adding a negative. Hence you'll see code such as:

subtract - + .


A Simple Program



Now that you understand the Core object methods, I can demonstrate a little program that I wrote to test the keyboard handling and printing out an ncurses interface. Currently if you read/write to an address of -1 you'll interface with the keyboard and terminal window. This is currently the only device interface on the new C vm, and I'll bring it in line with the OpenGL + Stereo sound of the old one soon. But for experimenting with new ideas, the current code base is sufficient. With out further ado, a simple program:

Copyright 2009 David J. Goehrig

The Term object provides basic interaction with the terminal window by driving character values out the port
located at 0ffffffff. Currently that address is the read/write port for all text io.

Term
key 1 - @ # .
emit 1 - $ , .

The Character object tests to see if the tos has the given key value

Character
space 32 .
tab 9 .
enter 10 .
whitespace? : space = ^ tab = | ^ enter = | .

The KeyTester object tests to see if various characters entered at the keyboard are white space

KeyTester
run Term key : emit Character whitespace? 66 + Term emit _ 32 emit .

Finally the App start method is the thing that kicks this whole thing off.

App
start 1 - ( KeyTester run ) .

The End

Comments, Legible Code, and Formatting


In this example, you can see how comments (lines which start with no tabs) are intertwined with code (lines which start with 1 or more tabs). This design allows for documentation to be a first class citizen. The excuse programmers give for not documenting their code is that as the code changes, the documentation skews. In most programming languages, comments require special formatting, extra delimiting characters which set them off from code. In NewScript, it is code that we distinguish by indenting it. This makes it easy to maintain your inline documentation, as you're not fighting your code formatting. NewScript has no end of line documentation like other languages.

Objects are declared by writing a word on a line with a single tab before it. By convention the object's name should start with a capital letter. This makes it easy to distinguish from ordinary method calls. When an object's name appears in code, it changes the context, so that all method invocations are sent to it, rather than whatever the current context being defined is. This allows you to change context, as necessary, within a definition.

Methods of objects are declared by placing two tabs on the line before the name of the method for the current object. If you changed context in the last definition, the context will be reset to the proper object. Methods can be called recursively, and require no special care, beyond the correct context being set. A method of the current object is invoked simply by writing the method name in a definition. A method which appears before a period or question mark will be optimized as a tail call. As a rule, loops containing method calls should not contain nested function calls more than 6 deep.

NewScript requires highly factored code. The flow control structures are designed to promote defining single lines of code. The use of parenthesis to indicate a looping construct intentionally matches some data-flow analysis techniques, which use similar notation to indicate repetitions. This sample application, uses a loop with a loop index of -1 which is the largest loop count you can have. To preform an infinite loop, one would use a recursive function, rather than a stack expensive loop construct like this:

App
start KeyTester run App start .

NB: We had declare the recursive application of App start in full, as we had changed the context to KeyTester! This is similar to how KeyTester run switches between Term and Character and back to Term to handle the io and value tests. All in all, this syntax pattern closely matches the Subject verb agreement that most English speakers will find refreshingly familiar.

Sunday, June 7, 2009

Javascript Changes - Phos Library

Well I've begun switching back from SVG to Canvas. After playing around with the internals of the new FireFox beta, I've grown rather weary of trying to get anything to run well on that incredibly poor code base. As a result, I back-ported the SVG code back to Canvas, and bundled all of the support code into a library called phos.js. It is released under GPL3, and can be used in any number of web-apps. So enjoy. If you want to track development, the latest version is available through git:
git clone git://github.com/cthulhuology/Phos.git
This should also mean that any changes to Phos (pun intended) to support other browsers should make its way back into the NewScript/JS environment. And since the NewScript/JS environment will eventually run the same virtual device driver stack as the NewScript/C emulator, this will be largely ignorable by the typical programmer.

NewScript/C Status Update

I am currently about 1/3rd of the way through writing NewScript in NewScript using the NewScript/C bootstrap compiler. I'm pretty much done with the memory layout, device driver support, and a simplified compiler. The emulator seems to be working pretty well, with no major bugs yet encountered. Speed is a little on the slow side, but most of the profiling work demonstrates there's not a lot that can be done to improve it.




Wednesday, June 3, 2009

NewScript in NewScript

So for the past couple nights, I've been spending a little bit of time on getting the NewScript environment implemented in NewScript itself. As I'm only using the bootstrap compiler, there's a number of things that will change in the final implementation of NewScript in NewScript. First off the bootstrap compiler doesn't support the Macro object, so there aren't any useful compile time macros. It also doesn't support Editor extensions, (which will handle edit time macros), and as a result requires a lot of tedious constants declarations.

Memory Management is for Suckers

One of the simple things I've done in the design is I've dedicated 4k of the image to System variables. Each of these has a name in the System object, and can be accessed using fetch and store operators. The variables have also been laid out in a linear fashion so that they can be reset using very simple code. For example, all of the word parsing related buffers can be cleared with a statement like:
reset                
 index !% 0 % % % % , -1 % % % % ,                        
  reset inputs and buffers falls through to next word
This simply fetches the address of the index variable, 0x1007, and then proceeds to set it and the following 3 variables to 0. Then the contents of the input buffer, (4 cells or 16 characters), are all set to -1, aka 0xffffffff. Reset then falls through into a routine which attempts to grab characters until either 16 characters are encountered, or a space is found. All of the addresses are known at edit time by design. There's no guesswork. The addresses have names to make the intent more clear, not to hide information from the programmer.

NewScript Way

By eschewing unnecessary complexity, we end up with code that is correct by design. Testing is less a process of figuring if the code works, but rather verifying there aren't any typos. By thinking about the layout of data in memory, we can optimize the organization of memory so that we can access it sequentially, ensuring optimal cache characteristics. Moreover, by designing our code in a similar fashion, we can avoid unnecessary branching, loops, and stack manipulation. The NewScript way is the easy way.

With the syntax highlighting editor, it is difficult to make a mistake. In fact, the same code could implement a feature where it would be impossible to type a word that isn't defined already. You could still type the wrong word, but at least you'd be guaranteed the word you did type at the very least exists. I am planning on doing similar things with memory, adding visualization techniques and tools to help plan out memory access at edit time. Rather than building an elaborate garbage collection apparatus, I would rather design tools that make laying out memory by hand easy. That's after all the NewScript way.

I've also been giving some thought to how objects will be represented in the memory image, and about how best to share code with other people. I'm leaning towards the idea of shared borders. When you're working on a project in a group, each team member will be represented by a region of the screen's border. To share code you simply hand it to them, and it pops up on their machine. When the other user is offline, you can simply drag the code onto their icon in a "Friends and Groups" region of the workspace, to send it to them asynchronously. Direct manipulation is the easiest interface to master, and that corresponds with the NewScript way.

NewScript Applications

I have also broken out the VM (all 422 statements of it) as a "library" that can be embedded into other applications. This change largely consisted of adding a header file, and tweaking the api of the boot() function to accept a filename. I'm probably not going to add hooks to make it easy to interface with the VM internals, as there's so little to interface with.

The reason I broke it was to support creating a few stand alone NewScript based applications that use the VM to run embedded image files. The idea is simply to take a NewScript system image, place it in one of the executable's data segments, and load it into "rom". This way, you can write very simple, self contained, NewScript applications that are easy to distribute.

In order to prove out this concept, I'm writing a little font editing application. I've been looking at available font rendering code, and don't care for most of the fonts I've seen. While I could use the GLUT stroke fonts, they seem like a lot of bother for very little. So I'll do my own version of them, that will be renderable using the NewScript VGDD engine. When the FPGA version is ready, I'll have fonts that it will be able to display. So the extra effort won't be wasted in theory.


Monday, June 1, 2009

C Compiler Changes

Earlier today, I did some minor cleanup on the NewScript WebApp and checked the code into a new git repository. I've removed the remainder of the old website, and redirected www.newscript.org to the WebApp itself. I also fixed up the Javascript implementation of the VM enough that it now sort of works. There's a few minor glitches with the runtime loop, but I didn't have enough time available to suss them out.

I have also checked in some of the final changes to the bootstrap compiler that is implemented in C. The code is fairly lean, and is in need of only a few minor tweaks before it will be ready for compiling the first system images. To give you an idea of how complex this code actually is, here are the quick and dirty metrics one of my tools pumps out:
Globals: 20
Functions: 25
Statements: 165 (6.6)
Lines: 293
So as you can see, the code really isn't that large. The statements/function ratio of 6.6 is a bit high for my tastes, but that will go down a bit when I remove the debugging information that plagues the current code base.

Future Development

Now that the bootstrap compiler is up and running, with only a few minor issues left to iron out, I can start writing a version of the NewScript Editor in NewScript to run inside the emulator. With the Javascript version now capable of running the same instruction set as the emulator, it should be possible to load the same system images into the WebApp as well. However, until I get around to implementing the emulated video display device, PCM audio output, keyboard, and mouse devices in Javascript, it will still be impossible to make useful "WebApps".

NewScript in NewScript will solve a few of the issues still left unresolved with the Editor's design. For example, Core words are compile time macros, but Macros are supposed to be edit time macros. Getting these elements working correctly will be much easier when the environment is written in NewScript entirely. As a result, this will mean that the Javascript app will morph into a version more like the C emulator, and effectively use the web browser as a very slow emulator.

As there is no proper NewScript Editor written for the bootstrap compiler, the new compiler parses ASCII text files using a very simple format:

Object
verb
code
comment
In this format it is the number of tabs before the first word on a line that determines the meaning of the word. 0 = Object, 1 = verb, 2 = code, 3 = comment. This makes typing up code in a plain text file a manageable affair. It also mimics the layout of the object slots in the NewScript WebApp.

New NewScript Blog

I've started a new developer blog, mostly so that I can clean up the webapp code and leave only the minimal amount of code necessary to push it into a git repository. Currently the entire www.newscript.org site is being served as part of the webapp, and that is making it difficult to roll out changes to each independently. And since there is now a mailing list, a wiki, a git repository, and a pair of blogs, it just makes sense to delegate the responsibility a little more.

The Road Thus Far

So far, I've done a few public presentations on NewScript, the architecture, the design, and the direction. Mostly the reception has been positive mixed with stunned disbelief. Over all I'm rather pleased with the reception. The VM has undergone several major revisions since I started releasing versions of the VM, but that's been a bit of a moving target for years now.

On the editor front, I've done two major revisions, the HTML Canvas and the SVG implementations, and I'm finishing up the third OpenGL implementation. After the OpenGL version is done, I'm probably going to revise the webapp version to incorporate an HTML Canvas implementation of the same interface. While the SVG version looks beautiful, there are too many changes necessary to make it perform well on most platforms. While the SVG version displays on the iPhone, it just makes it too slow to use. Part of this is that the SVG implementation was a direct port of the canvas one, and that resulted in a suboptimal usage of the SVG tree.

The native compile has had a few major revisions as well. The original Intel compiler is now basically unusable as the opcodes behavior has changed substantially. The native opcode set has been implemented in Javascript and C, but only the C version has a working interpreter. There is also a primitive AVR32 compiler fleshed out but not tested.

Probably the most exciting thing to have been completed is a well documented hardware architecture for the soon to be System on a Chip (SoC) that will take NewScript code performance to the practical for everyday coding level. With the addition of the virtual devices to the C system emulator, there is now a cross platform API to which NewScript programs can be developed and launched.

The Road Ahead

With so much flux, the roadmap has become a little murky. But this is my overview of what is coming down the pipe for the remainder of 2009:
  • C implementation of VM finished
  • C implementation of bootstrap compiler finished
  • NewScript in NewScript implemented: editor, compiler, system library
  • Javascript port of the VM revised to match C implementation
  • Milestone 1: Portable image based development environment (Summer)
  • Cross compiler for Intel
  • Cross compiler for AVR32
  • Cross compiler for ARM
  • Milestone 2: Portable native development environment (Fall)
  • Verilog description of SoC
  • VHDL description of SoC
  • Synthesis on a high end FPGA
  • Syntehsis on a low cost FPGA
  • Milestone 3: Hardware development platform (Winter)
This basic roadmap would be outrageously optimistic if it weren't for the fact that in the past 3 months, I've implemented the VM 3 separate times on 3 different platforms. Also since the SoC has been part of the overall design for the past two months, many of the basic design decisions have been tested repeatedly already. By the time that I'm ready to start finalizing the hardware design, I will have implemented the VM at least 4 more times, for 4 different platforms. Since the C and Javascript versions of the emulator are producing images that will run natively on the NewScript hardware, without any additional translation, the full software stack will be ready for the hardware.

So enough talk... back to work.