The newly released ds library (version 1.3) now includes an efficient deque implementation.
What is a deque?
A deque is a double-ended queue and pronounced “deck”. In contrast to a queue, which allows insertions at one end and removals at the opposite end, a deque allows insertions and removals at both ends. The new Deque interface defines four methods to modify the Deque along with two methods to access the element at the front (head of the list) and back (tail of the list):
interface Deque<T>
{
function front():T;
function back():T;
function pushFront(x:T):Void;
function popFront():T;
function pushBack(x:T):Void;
function popBack():T;
}
There are two common ways to implement a deque – by using a doubly-linked list or by using arrays. While the first option is very simple and straightforward, the latter one is more of a challenge because there are many approaches which all have their pros and cons.
This is basically just a stripped down, lightweight version of the doubly linked list class DLL. A doubly-linked list is capable of doing insertions & removals in constant time, but the operation itself is rather slow, since we need to instantiate a node object that stores the element and manage pointers to prevent the list from falling apart.
Another important thing to watch out is that the LinkedDeque class requires considerable more memory – for each and every element we need a container object for storing the “cargo” (+16 bytes), a reference to the previous and next node (+4 bytes each) and finally a reference that points to the element (+4 bytes). So in Flash we need a total of 28 extra bytes per element. This means that 86% of the space is wasted on nodes and pointers!
Despite those shortcomings it’s still useful for small to average sizes; and to minimize costly node object instantiation the LinkedDeque class (and all linked structures in ds) comes with built-in node pooling – all you have to do is to create a LinkedDeque object with a reserved size greater than zero:
var deque = new LinkedDeque(100); //reuses up to 100 nodes
This is very useful if the maximum size is known in advance, as performance nearly doubles. The benchmarks below were all done with “node-caching” turned on.
While most sites explain how a doubly-linked list relates to a Deque, there is not much information available on how to efficiently implement a deque on top of an array. My first attempt used a circular array similar to the ArrayedQueue class. It seemed to be a good solution until I realized that resizing the deque would be slow and difficult to implement so I discarded this idea and started from scratch. This time I decided to use an array of arrays (turned out the C++ STL deque uses this approach) and I was very happy with the result.
Let me briefly explain how it works. Data is organized in chunks or blocks of memory. A block is simply a fixed-sized array and all blocks are stored in a separate, dynamic array. Upon initialization the deque only contains a single block but once it fills up an additional block is allocated: adding elements to the back calls something like “blockListArray.push(newBlock)” and adding elements to the front calls “blockListArray.unshift(newBlock)”.
Inserting elements at the beginning of an array is slow but in this case it doesn’t matter because the list of blocks is usually very small and it doesn’t happen very often. Therefore we can say that the ArrayedDeque performs modifications at both ends in amortized constant time.
Performance
So how do both classes perform? Showing raw numbers would be pointless without being able to compare them to some reference values. For this reason I’ve created a minimal deque implementation called “NaiveDeque”:
class NaiveDeque<T> implements Deque<T>
{
var _data:Array
public function new() {
_data = new Array();
}
public function front():T {
return _data[0];
}
public function pushFront(x:T):Void {
_data.unshift(x);
}
public function popFront():T {
return _data.shift();
}
public function back():T
{
return _data[_data.length - 1];
}
public function pushBack(x:T):Void {
_data.push(x);
}
public function popBack():T {
return _data.pop();
}
}
As you see it just uses what the Flash language has to offers. Clearly, the problem which the code above is that modifications at the beginning of the array are slow because a lot of memory is shifted around, either to fill a gap or to make room for a new element. So we kinda have the worst-case at the front and the best-case at the back. It’s not very fair to use this as a reference but on the other hand it shows how things can be drastically improved by using some elbow grease to write clever data structures. Here are the results:
size
1000
2000
3000
4000
5000
NaiveDeque
1.0
1.0
1.0
1.0
1.0
LinkedDeque
1.65
4.4
6.6
7.2
7.7
ArrayedDeque
5.0
9.3
11.7
14.8
19.3
Speed relative to “naive deque”, higher is better
The benchmark results were taken by equally distributing n elements at both ends, like this:
//insert...
for (i in 0...n)
{
if ((i & 1) == 0)
deque.pushBack(x);
else
deque.pushFront(x);
}
//remove...
for (i in 0...n)
{
if ((i & 1) == 0)
deque.popBack();
else
deque.popFront();
}
Usage
One real-world application that comes to my mind is a software application’s list of undo operations; but as a deque is a very flexible abstract data structure it can be used in countless algorithms. The deque classes are all documented so I hope it’s clear how to use them. If you have problems feel free to contact me.
Here is a minimalistic FlashDevelop AS3 project that demonstrates how to use the Molehill API to draw a single 2D sprite on the screen. You can get the project here: fdproject-mole2d.zip.
To compile the example you need the Incubator version of the Flash Player, a new FlexSDK (“Hero”) and an updated playerglobal.swc which includes the Molehill API.
All files can be downloaded from Adobe Labs.
After downloading and unzipping the SDK to the folder {flexSDK} create a directory named “10.1″ in {flexSDK}\frameworks\libs\player\ and copy the new playerglobal.swc into it. Next fire up FlashDevelop and in Settings->AS3Context update the Flex SDK Location so it points to the new SDK. Finally just unzip & extract the FD project and hit F5 to compile.
Here are the basic steps to draw a textured quad:
Create a vertex buffer with 4 points to define a quad.
Create an index buffer that keys into this vertex buffer to define two triangles.
Create and upload a texture.
Compile the vertex and fragment programs (I’m using the AGAL mini-compiler from Adobe).
Then on every frame:
Assign the vertex and fragment program.
Transform the sprite and create a model-view-projection matrix for the vertex program. Since this is a 2D example we have to set up an orthographic projection.
Draw the triangles.
The mvp transformation could be done by the vertex shader but to keep it stupid simple I’m creating the matrix in AS3. Also note that we need to flip the y-axis and shift the origin to the upper-left corner.
Assuming you have the incubator Flash player installed the result looks like this:
The Dictionary class – a dead end for other targets
Recently I started to adjust my code with the C++ and JavaScript haXe target in mind. I have to say it’s really amazing to see your code running on different platforms. This offers an incredible amount of flexibility! While almost everything worked out of the box, the biggest hurdle was to find a replacement for the flash.utils.Dictionary class, which I was using extensively in my library.
Although the haXe API already offers a Hash and IntHash class (both using the Dictionary for the flash target), I’ve decided to implement my own hash tables so they blend in nicely into my ds library.
What are hash tables?
A hash table is a very important and fundamental data structure in computer science. It offers rapid insertion, access and deletion of sparse data. This is achieved by distributing the data amongst a fixed number of slots using a hash function.
There are many ways of implementing hash tables, but from my experience (and from reading dozens of papers) the easiest solution is something called a ‘standard-chain hash table’, which simply uses linked lists as a collision resolution scheme. When linked lists are replaced with dynamic arrays, the result is called an ‘array hash table’, which offers a smaller memory footprint because it eliminates nodes and pointers all together and is more cache friendly.
Other implementations include ‘bucketized cuckoo hash tables’ or ‘open-address hash tables’ using techniques like linear probing or double hashing for collision resolution. Those are interesting if you are into computer science, but from the point of view of a game programmer, they don’t offer any advantages, but have drawbacks of some kind (problems with high load factor, difficult to implement, complex hashing and so on).
The hash table implementation which I’ve included in the ds library is somewhere in between a standard-chain hash table and an array hash table – the data is stored in an array while still offering basic advantages of a linked list, namely removal in constant time and move-front-on-access by pointer manipulation.
So the procedure for searching a key in my code is:
Hash the key to acquire a slot.
Check if the slot is empty. If it’s empty, the key does not exist and the search fails.
Otherwise scan the array a key at a time until a match is found or until the array is exhausted (search fails).
If the key was found, move it to the front of the list by pointer manipulation. This way further queries to the same key take less time (optional).
Implementation overview
The latest version of my ds library now contains a bunch of additional interfaces and classes:
Classes implementing de.polygonal.ds.Map:
IntIntHashTable – A hash table using 32-bit integers for both keys and values.
IntHashTable<T> – A generic hash table using 32-bit integer keys.
HashTable<K, T> – A generic hash table. Keys have to implement Hashable.
HashMap – The existing wrapper for the Dictionary class, which was updated so it now implements Map instead of Collection. Flash only!
Classes implementing de.polygonal.ds.Set:
IntHashSet – Similar to IntIntHashTable, this is a set for integer values.
HashSet<T> – A generic set. Values have to implement Hashable.
ListSet<T> – A simple set backed up by a list; this is a replacement for the former Set class that now defines an interface.
IntIntHashTable and IntHashSet are the most important ones since they provide the core functionality used in other classes. IntHashTable, HashTable and HashSet extend their behavior via composition.
The core classes use ‘alchemy’ memory and lots of inlining to achieve very good performance. I’ve documented all classes, and I hope it’s not that difficult to understand. Some notes:
Although hash tables implement a Map interface, they allow multiple entries of the same key following the FIFO rule. If you want to make sure that the hash table contains unique keys, use the setIfAbsent(key, val) method or manually check for existence using hasKey(key) before calling set(key, val). Example:
//duplicate key
var hash = new IntIntHashTable(1024);
hash.set(1, 2);
hash.set(1, 3);
trace(hash.hasKey(1)); //true
trace(hash.get(1)); //2
hash.clr(1);
trace(hash.hasKey(1)); //true
trace(hash.get(1)); //3
//ensure unique keys
var success = myHashTable.setIfAbsent(1, "value1"); //success == true
var success = myHashTable.setIfAbsent(1, "value2"); //success == false
//or
if (!myHashTable.hasKey(1)) myHashTable.set(1, "value");
As you know, the Dictionary lets you use the object ‘identity’ as a key, and not the value returned from calling toString() on it. This behavior is not available in JavaScript and other targets. So objects used as keys in a HashTable or as values in a HashSet have to implement the Hashable interface or extend from the abstract class HashableItem. Hashable defines a single method getKey() (or just a field key if using haXe) returning an unique integer value to identify the object. This is a little awkward, but I currently don’t see any other (simple) solution. Example:
class Foo implements Hashable {
public var key(default, null):Int;
public function new() { key = HashKey.next(); }
}
//or
class Foo extends HashableItem {
public function new() { super(); }
}
...
var item = new Foo();
myHashTable.set(foo, "myValue");
For the AS3 users out there, I’m providing the following swc files (xxx stands for debug or release):
ds_xxx_alchemy.swc: ds version with alchemy support for FP10.
ds_xxx_.swc: ds version without alchemy support but using Vectors for FP10.
ds_xxx_fp9.swc: ds version without alchemy support but using Arrays for FP9.
When using the alchemy version, it’s important to call free() to release all used memory once the object is no longer needed.
The Dictionary class
If you want to take a look at the hash table implementation used in the Dictionary, you can do so by downloading the Tamarin source code and open the file core/MultinameHashtable.cpp or core/avmplusHashtable.cpp. From my understanding (please correct me if I’m wrong!) flash uses an open-addressed hash table with quadratic probing for resolving collisions. This means that keys are directly stored within hash slots, while my hash table is open: in the case of a hash collision, a single slot stores multiple entries, which must be searched sequentially. Also the Tamarin implementation automatically rehashes the hash table when it’s full (load factor > 0.75), while I’m only allocating more capacity and leave rehashing to the user, because it’s an expensive operation and performance only slightly degrades when the load factor goes beyond 1.
Performance
So here are the numbers for the IntIntHashTable. All results were done with the latest Flash Player for Windows. You can compile & run the benchmark on your own, but I guess you have to adjust the number of iterations or you get a script timeout.
The chart below shows how the HashTable behaves when a slot contains more than one key. A load factor of 1 means that each slot contains only one key/value pair. A load factor of 2 means each slot has two pairs that have to be searched sequentially and so on.
So it’s very predictable. A load factor of 10 makes the hash table 10x slower.
The chart and table below compares read/write/access operations between a IntIntHashTable and the Dictionary. Both structures are preallocated to minimize the time it takes to find a key.
size
1024
2048
4096
8192
16384
32768
65536
131072
IntIntHashTable
0,08
0,14
0,29
0,57
1,15
2,29
4,75
10,03
Dictionary
0,84
1,54
3,3
6,7
16,5
44,76
97,72
218,32
x times faster
10x
11x
11x
11x
14x
19x
20x
21x
Results are in milliseconds
The last chart compares the generic HashTable implementation against the Dictionary class. The difference is not too big, still the HashTable is about 20-30% faster.
size
1024
2048
4096
8192
16384
32768
65536
131072
HashTable
0,28
0,57
1,15
2,3
4,7
9,28
20,55
39,45
Dictionary
0,37
0,72
1,35
2,73
5,6
13,48
27,45
52,48
x times faster
1,32x
1,26x
1,17x
1,19x
1,19x
1,45x
1,34x
1,33x
Results are in milliseconds
Conclusion
While in most cases you won’t notice a big difference I have some applications in mind like hierarchical spatial hash tables for collision pruning where the extra speed should be very noticeable. So what’s next for ds? Probably a deque structure and lots of bug fixing of course ;)
Starting with DS version 1.1, there is now a new MemoryManager available, which was almost written from scratch. My past experience showed that the former implementation was somewhat limited, as you were forced to preallocate a fixed amount of memory prior to your application’s entry point. The new version now supports dynamic memory allocation, so memory consumption grows with application demand (you can still define an upper limit to be on the safe side). It’s now also possible to resize memory space after it has been allocated. I’ve also simplified the API and optimized the code.
Using alchemy memory in HaXe is now as simple as writing:
var integerArray = new IntMemory(x);
integerArray.free();
The first line allocates space for storing x 32-bit integers, and the second line will free up used memory once it’s no longer needed. The same also applies to BitMemory (a bit vector), ShortMemory (16-bit integers), FloatMemory (32-bit IEEE 754 single-precision floats) and DoubleMemory (64-bit IEEE 754 double-precision floats). Each implementation provides a get/set method to read/write the value from/to index i (I hope HaXe gets support for overriding [] in the future so we can omit the get()/set() methods); an offset property that stores the memory offset in bytes and a bytes property indicating the number of allocated bytes. Custom classes can be realized by inheriting from the abstract MemoryAccess class (yes, HaXe has private constructors ;-))
A visual representation of the MemoryManager (roll over). Randomly allocates and deallocates memory while calling pack() at regular intervals. Colored blocks: allocated space; Black: empty space; White lines: block size.
Frequent allocation/deallocation leads to fragmentation so the user can request defragmentation by calling MemoryManager.defrag(). In addition to the defrag() method, MemoryManager.pack() frees up unused space so it can be garbage collected. To be exact: All existing bytes are copied from the current ByteArray accessed by the alchemy memory opcodes to a smaller ByteArray, and once the reference to the former ByteArray is GC’ed the former content is deallocated (setting the .length property of the ByteArray seems to be a bit flaky and sometimes results in a runtime exception).
A quick note about MemoryManager.BLOCK_SIZE_BYTES: This value defines the growth-rate of the memory. The default is 1024 bytes (must be a power of 2). Every time the MemoryManager runs out of memory, a multiple of this block size is added. The block size affects performance and storage efficiency. As a rule of thumb the block size should match the average size of all ‘memory arrays’ used in the application. Using a tiny block size when storing 2k images is obviously a bad idea.
A major advantage of using alchemy memory is that you can use the appropriate data size that fits your needs which brings down memory usage. For example if you know in advance that your numbers are in the range 0…100 you can use bytes, or if you need floats without full 64-bit precision, you can fall back to 32-bit floats. In AS3, you are constrained to 32-bit integers or 64-bit double precision floats. Using the ByteArray is not an option because it’s too slow.
Performance
Although the preview version of FP 10.1 slowed down memory access a bit (it seems that the latest release fixes most performance issues) using those special alchemy memory opcodes is by far the best way to work with numbers. Time for some real world examples!
Example 1: JPEG encoding
After porting the JPEG encoder from the Flex framework (mx.graphics.codec) to HaXe I’ve encoded an empty 1024×786 bitmap and compared the numbers:
Speed relative to the AS3 version (higher is better)
The HaXe version encodes the image in about 100 milliseconds, the AS3 version takes almost a second.
Example 2: Bit vectors
My next test compares an AS3 bit vector implementation from generalrelativity.org with the BitMemory from the ds package:
Speed relative to the AS3 version (higher is better)
Combining alchemy memory, optimized byte code and inlining gives some excellent results :)
Example 3: FP10 drawing API
The last test draws triangles with the FP10 drawing API and measures the time needed to build a GraphicsPath object from a command and a data vector (without calling graphics.drawGraphicsData()). The HaXe version uses my VectorRenderer class which utilizes alchemy memory as a temporary buffer.
Speed relative to the AS3 version (higher is better)
All results are based on the windows release player 10,0,42,34 and use the MemoryManager class.
A great HaXe feature is that you can define your own Iterator and execute it with the for-syntax.
It can be used in many different ways and drastically improves readability of your code. AS3 developers often need to look at the display list, so I wrote a basic DisplayListIterator to handle this task. Here is an example:
using de.polygonal.gl.DisplayListIterator;
...
for (i in Lib.current.stage) trace(i);
This will print out all display objects in the display list.
If you are not familiar with HaXe, Lib.current.stage points to the MovieClip of the Document class, and the using statement automatically creates a DisplayListIterator whenever it is called on a DisplayObjectContainer. So the statement ‘Class.method(arg)’ is transformed to ‘arg.method()’. Without ‘using’ I would have to write:
import de.polygonal.gl.DisplayListIterator;
...
for (i in new DisplayListIterator(Lib.current.stage)) trace(i);
Let’s finish with a simple example that changes the text color of all text fields to red inside a DisplayObjectContainer.
var container:Sprite = myTextFieldContainer;
for (i in container)
{
if (Std.is(i, TextField))
{
cast(i, TextField).textColor = 0xff0000;
}
}