Cmm programme details

1 Overview

This file contains some information about the design and implementation of Cmm. The directory containing this page (./) contains the Cmm source code.

See ../users.html for a description of the Cmm C++ language extension and Cmm usage information.

2 Brief description of the Cmm source code

Cmm is written in (hopefully portable) C++. It was designed in C++ rather than UML or similar (which I'm not going to apologise for BTW) and is very much work in progress.

The main component is a simplified C++ parser. This parser doesn't do a complete parse - for example by default it ignores the {...} in all function and class definitions (the -detailedparse switch can be used to change this). All types and functions defined by the parser are in the namespace parser.

The other main components are support for cmm -c, cmm -l and cmm -s. These are contained in the structs SecondPassInfo and ThirdPassInfo respectively, both in the namespace Cmm - see cmm.h.

The generated parse tree consists of statically-typed pointers to various structs that represent different C++ constructs. These structs all inherit from the base type parser::node_t, and the inheritance tree for the different node types reflects the C++ grammar. See parser.h for details of the parse tree representation.

The source code that is parsed is kept in memory as a zero-terminated C-style string, and each node that makes up the parse tree contains pointers into this string (not a copy of parts of this string).

Lexical elements like identifiers, keywords and operators are represented as individual node types. For example, the C++ keyword int is represented as a Keyword_int, and the C++ operator -= is represented as a keyword_MINUSEQ; both of these derive from the struct keyword which in turn derives from node_t. C++ RTTI is used to determine the type of a node in the parsing code. For example, the lexer's lexer::GetNext() method returns a node_t*, and the parser then uses C++ RTTI to determine the actual type of the node. This approach is also used in higher-level parsing - for example after getting a top-level node, the parser use RTTI to determine whether the node represents a function. See lexer.h for information on the lexer interface, and parser.cpp for the parsing code.

The parser is hand-written C++. I have used lex and yacc in the past, but have tired of their indescriminant use of macros and global variables. Also, there isn't a readily available yacc grammar for C++. There are some interesting comments about parsing C++ in section 3.3.2 (page 68) of Stroustrup's The Design and Evolution of C++.

2.1 Limitations

The parser doesn't do any name lookup. Unfortunately C and C++ are impossible to parse correctly without name lookup (e.g. A * B;is a declaration if A is a type, otherwise it is an expression), so the parser doesn't always get things right. In particular, it looks for a declaration before looking for an expression. So foo( x); will always be treated as a declaration even inside function bodies, where it is more likely to be a function call.

The parser leaks memory - the parse tree nodes contain plain pointers to subtrees and have no destructors. This simplifies the code - there are no reference counted pointers etc - and allows the parse tree to be manipulated easily. For example, when generating a new function with the same parameter types as an existing function, one can simply make the new function point to the existing function's parameter node.

If leaking memory had to be avoided, it would be possible to keep a list of all allocated nodes by making the base node's constructor parser::node_t::node_t() update a static list of allocated nodes. An alternative would be to use referenced-counted pointers everywhere. Cmm is an experimental command line programme that doesn't run for very long so, for the moment at least, the memory leaks don't really matter.

The generation of functions also leaks memory - heap-allocated std::stringstream's are used to contain pieces of source code that make up the functions. I've had a few problems with VC++ and STLport's std::stringstream class which has complicated this slightly.

The parser is rather slow, mainly because of the lexer - it does a simple comparison with each lexical element using the function StartsWith() in parser.cpp, until it finds a match. Of course there are much better ways of extracting lexical elements, but I haven't had time to improve this aspect of the programme yet.

3 Operation for cmm -c

The input Cmm file is parsed, and a Cmm::SecondPassInfo is constructed from the parse tree (see cmm.h). The constructor for this class walks through the parse tree looking for all virtual functions, and also for all functions that could be implementations of known virtual functions. This information is used to generate the Cmmi file and output C++ source making various changes to convert from Cmm to legal C++.

4 Operation for cmm -l

A Cmm::ThirdPassInfo is constructed (see cmm.h) and then passed each of the input Cmmi files in turn. The virtual and implementation function information is extracted and organised so that the implementations for each virtual function are known (independant of which Cmm file they originated in). This allows dispatch functions to be generated and appended to one of the implementation files.

5 Format of Cmmi files

Cmmi files are designed to be written and read by the same Cmm programme during a build so the format doesn't need to be fixed forever. The current format is:

Cmmi:        virtualfn-spec*
virtualfn-spec:        virtualfunction "<c++filename>" <virtual fn prototype> <implementation>*
implementation:        implementation "<c++filename>" <numparams> paraminfo ... <matchfn> <callfn>
paraminfo:                <derivationdepth> <classname> <classname> <classname> ...

It's probably much easier to look at an example to understand the format - see ../examples/overlap/main.cmmi. The code that writes/reads Cmmi files is in cmm.cpp.

6 Dispatch functions

Example dispatch code can be seen in ../examples/overlap/main.cppextra. It might be easier to read this rather than try to understand the textual explanation below.

The dispatch function has to decide which implementation function to call, given the particular set of dynamic types of it's parameters. Finding this is a slow process, so results are cached in a static std::map that maps from a std::vector of std::type_info*'s to a function pointer.

The std::map and std::vector container types can be changed with cmm's -map and -array parameters. This is useful because the dispatch speed depends primarily on these containers.

There are two support functions that cmm -c generates for each implementation it finds. cmm -l also declares these functions just before the dispatch function, so that the dispatch function can call them.

The first support function takes just the virtual parameters, and does a dynamic_cast<> on each one to test whether it is compatible. This function returns true if the implementation matches or false if it doesn't.

The second support function is the implementation that the dispatch function calls. It has exactly the same signature as the virtual function, i.e. it takes references to base types, rather than the implementation's derived types. This allows the dispatch function to use it without knowing anything about the derived types - they could be undefined in the dispatch function's compilation unit. This function uses static_cast to get the derived types - it assumes that the dispatch function is correct and is actually giving it the appropriate derived dynamic types - and then calls the real implementation.

7 Algorithm for finding the best implementation

Code generated by cmm -c uses a faulty algorithm. Cmm -s uses code in dispatch.cpp, which is correct (though slower).

8 Cmm -s

This mode works by using a library dispatch.cpp that accepts registration of multimethod implementations at runtime, and also provides a generic dispatch function that works for all multimethods. For each implementation, Cmm adds a global variable whose initialisation causes the implementation to be registered with the library.

Because multimethod dispatch is done using the library, the actual multimethod-specific code is very simple, and so each compilation unit that contains a multimethod implementation will also get an implementation of the dispatch function. This function is defined as inline to allow it to occur in multiple compilation units. Whether the final executable ends up with multiple copies of this function depends on how good the linker is at removing duplicate code.

The generic dispatch works by dealing with arrays of void* which point to the virtual parameters. The implementation-specific match functions first static_cast these to the appropriate base class before using dynamic_cast to decide whether they have the correct dynamic type. The generic dispatch function is also given an array of std::type_info* which are used to to address a cache, in the same way as with cmm -c/-l.

This all means that there is more overhead involved in a dynamic dispatch. Instead of Cmm generating a complete dispatch function for each multimethod, it generates a functions which puts the addresses and typeids of the virtual parameters in two arrays, and calls the generic dispatch function, which returns a pointer to the appropriate implementation. The generic dispatch function does much the same things as cmm -l's dispatch function, so the overhead is the creation of the arrays and the extra function call.

Please note that the code in cmm.cpp is pretty horrendous, and needs a severe cleanup/rewrite.

9 Future

As well as fixing the limitations mentined in ../users.html, the following are things that I'd like to try:

9.1 Dynamic loading

Cmm -s allows runtime modification of the available implementations, and things work automatically when DLLs are loaded/unloaded..

9.2 Unrestricted dynamic dispatch

The idea is to have the caller decide that a function call should be resolved dynamically, with no restriction on having base classes for each parameter. This would be done when the call to a function prefixes one or more arguments with virtual. This makes the dynamic dispatch case similar to conventional compile-time overloading, where no base types are required when lookup is performed.

As of cmm-0.27, the experimental -caller-dispatch flag (see test107 and examples/caller-dispatch.cmm) supports this usage.

9.3 Optimising of single virtual parameter case

A virtual function with a single virtual parameters is equivalent to a conventional C++ virtual method. Cmm has access to all source code when a project is built, so it could add virtual member functions to classes and make dispatch functions call these methods directly. For example a Cmm file:

struct Base
{   ...
};
void Foo( virtual Base& x, double y);

struct Derived : Base
{        ...
};
void Foo( Derived& x, double y)
{   // implementation for class Derived.
}

- would be translated into:

struct Base
{   ...
    virtual void cmm_Foo( double y) = 0;
};
inline void Foo( Base& x, double y)
{
    return x.cmm_Foo( y);
}

struct Derived1 : Base
{   ...
    virtual void cmm_Foo( double y)
    {
        Base& x = *this;
        // implementation for class Derived
    }
};

And no separate dispatch function would be required.

There is one difference here, in that a compile time error will be given if the original Cmm source doesn't contain an implementation of Foo( Derived&, double) for some class Derived. Cmm as it stands would give only a runtime error. However, one could make Cmm output a Base::cmm_Foo( double) that throws an exception.