One of the recent topics I dove into for my project Todool was Spell Checking.

Obviously spell checking itself tends to be very complex due to each language having their own complex rules, but you can accomplish basic English spell checking as described here.

In short here's how it works:

  1. load English ebook (ASCII)
  2. read through all words (lowercase all characters)
  3. add each word to a hashmap

Spell Checking is finding if a key: string exists in the map. You can also apply modification as described in the blog to check for offspring of the key you're searching for.

The ebook I'm using has ~30_000 unique words that will be added to the map - this can quickly use up a lot of memory in your program. Which is what we will dive into now

String Storage

We want to store all these unique words in a data structure, use as little memory as possible and also allow searching the data structure quickly.

Wikipedia has a good section on data structures that exist for strings.

The Trie stands out to me as the simplest way to describe a ASCII word tree.

Screenshot from 2022-10-28 13-27-10.png

The other data structures tend to quickly get complicated to implement - but offer other benefits like supporting UTF8 strings or offering less memory usage.

Trie Implementation (in odin)

Tries are very basic to implement and test out.

// [a-z] only
ALPHABET_SIZE :: 26

// treats nil pointer as non existant node
// array['a'] = nil -> 'a' doesn't exist
// array['b] = {} -> 'b' exists
Trie :: struct {
	array: [ALPHABET_SIZE]^Trie,
}

I'll leave implementing the insert & search operations for the trie up to you as an exercise.

You can see the data structure is really simple but it uses up 26 * size_of(^Trie) == 208B per Trie created.

More compact?

Here is another approach to reduce this by half, storing only 104B per Trie Node as indexes use 4B less.

// treats 0 as nil
//
// u32 is an index into some array where all the node data is stored
// instead of allocating nodes through: new(Trie)
CTrie :: struct {
	array: [ALPHABET_SIZE]u32,
}

// the root of the trie can be the first element of your data
// root := ctrie_data[0]
ctrie_data: [dynamic]CTrie

Anything better?

You can use a Linked List instead of an array, but this reduces the speed of the searching operation -> since now you can't directly access the searched characters anymore.

In the Spell Checker we want the searching operation to be the fast though - the insertion can be done at iniationlization once.

Is there a way to reduce this further? Give it a try yourself.

Setting Yourself Restrictions

Due to the nature of Trees it's hard to reduce memory usage as you have to support inserting elements at any node.

One way of achieving less memory usage is by setting yourself restrictions or finding out how you will use the trie in the end.

In my case I would like to

  1. preprocess the Trie Tree of my Spell Checker into a binary blob that I can bake into my application
  2. not ship the ebook itself
  3. never insert into the Trie Tree after it was initialized
  4. no post processing after I load the binary blob

Alphabet Bitfield

One way to replace the [ALPHABET_SIZE] array is by using a bitfield. We have 26 characters from [a-z] that could all exist - meaning 26 bits will be necessary.

We can fill such an alphabet bitfield in a u32 - here is a quick showcase of this.

main :: proc() {
	ALPHABET_SIZE :: 26
	alphabet_bits := u32(0b1101)

	for i in 0..<u32(ALPHABET_SIZE) {
		mask := u32(1 << i)
		
		if alphabet_bits & mask == mask {
			fmt.eprintf("found bit %v = %v\n", i, rune(i + 'a'))
		}
	}
}

results in

found bit 0 = a
found bit 2 = c
found bit 3 = d

Using the bitfield uses less memory and is still powerful enough.

We can count all the 1 in the bits to determine how many characters are supported: intrinsic.

We can check if the bits == 0 to skip processing early as the bitfield is empty and doesnt contain any valid bits.

Bitfield Trie "Compression"

The bitfield helps not using the array as we go from atleast 26B -> 4Bbut now we have the issue of not knowing which bit belongs to which Pointer or Index Handle.

As we want to reduce memory usage we will have to go the "relative" path, in which we know that after a bitfield -> there will be N trie nodes coming up.

I will present a way to go from a Trie or CTrie structure I've shown before to this "compressed" form of the trie that uses alphabet bitfields.

// globals

// contains bitfield u32 and [N]u32 indices into the comp array
// [u32][2x u32 u32] -> [u32][3x u32 u32 u32] -> [u32][0x] -> [u32][1x u32]

// index 0 is the root bitfield
comp: []byte
// index we advanced the compressed data
comp_index: int

We'll need one procedure to add a bitfield in the byte array and return the pointer so that we can modify it

// push u32 alphabet bits data
comp_push_bits :: proc() -> (res: ^u32) {
	old := comp_index
	res = cast(^u32) &comp[old]
	comp_index += size_of(u32)
	return
}

Next we need a way to create the necessary indice array based on the ones we got from counting.

// push trie children nodes as indexes
comp_push_data :: proc(count: int) -> (res: []u32) {
	old := comp_index
	res = mem.slice_ptr(cast(^u32) &comp[old], count)
	comp_index += size_of(u32) * count
	return
}

Combined together we can now recursively step through the CTrie and push the correct data to the byte array

// push a ctrie bitfield and its dynamic data
comp_push_ctrie :: proc(t: ^CTrie, previous_data: ^u32) {
	if previous_data != nil {
		previous_data^ = u32(comp_index)
	}

	alphabet_bits := comp_push_bits()

	for i in 0..<ALPHABET_SIZE {
		if t.array[i] != 0 {
			alphabet_bits^ = bits.bitfield_insert_u32(alphabet_bits^, 1, uint(i), 1)
		}
	}

	if alphabet_bits^ != 0 {
		ones := bits.count_ones(alphabet_bits^)
		data := comp_push_data(int(ones))
		index: int

		for i in 0..<ALPHABET_SIZE {
			if t.array[i] != 0 {
				comp_push_ctrie(ctrie_get(t.array[i]), &data[index])
				index += 1
			}
		}
	}
}

That's it! Essentially we're removing the dynamic nature of the tree and keeping memory usage at a minimum.

Search Operation

The search operation can now use this "compressed" form of the trie directly by using using bitfields and stepping through the memory correctly

Here is a simple example for my Spell Checker

comp_search :: proc(key: string) -> bool {
	alphabet := mem.slice_data_cast([]u32, comp)
	next := u32(0)

	for i in 0..<len(key) {
		b := ascii_check_lower(key[i])

		if ascii_is_letter(b) {
			letter := b - 'a'

			if res, ok := comp_bits_index_to_counted_one(alphabet[next], u32(letter)); ok {
				next = alphabet[next + res + 1] / size_of(u32)
			} else {
				return false
			}
		} else {
			return false
		}
	}
	
	return true
}

Source Code

You can find my implementation of the CTrie and the Bitfield Trie here

Results

The ~30_000 unique words take up:

SIZE in 8183968B 7992KB 7MB for CTrie -> default Trie would be 2x this
SIZE in 629532B 614KB 0MB for compressed

The compressed byte array can be exported to a file, loaded back on startup or baked into the application and be immediatly usable :)

End

Didn't think this blog would be this long but hopefully this demonstrates that even a basic data structure can be used for a basic Spell Checker.

Any time I searched for more memory efficient Trie's the answers would be: Use x because it's better or Use x library without going indepth.

Update 1

Another way to reduce the "compressed" result is to cut off tries with only 1 valid bit in long chains.

Here is an example where you can see the alphabet bit count.

2bit -> 2b -> 1b -> 2b -> 1b -|||> 1b -> 1b ->1b where ||| would be the region we shortcut and instead store the bytes linearly.

Ctrie Data

A minor thing I added for pre generating the CTrie was including a count so you don't manually have to check for the filled out array indices.

CTrie :: struct #packed {
	array: [ALPHABET_SIZE]u32,
	count: u8, // count of used array fields
}

Ctrie Push

Down below you can see checking for single branches only is a step we do before proceeding with the default recursive nature of traversing the Ctrie tree.

We check wether the curren trie has 1 alphabet bit only and check for any chains of this happening till the end of its trie tree.

When they are all 1 or 0 length only we can concatenate the characters together and insert that string into the byte array.

The string length is then stored in the first [0-26] bits and the rest [26-32] bits indicates the bitfield is a shortcut, which we can check for early.

// push a ctrie bitfield and its dynamic data
comp_push_ctrie :: proc(t: ^CTrie, previous_data: ^u32) {
	if previous_data != nil {
		previous_data^ = u32(comp_index)
	}

	alphabet_bits := comp_push_bits()

	// on valid sub nodes -> insert data
	if t.count != 0 {
		field: u32

		// check for single branches only
		if t.count == 1 {
			single_index = 0
			single_only := ctrie_check_single_only(t)
			// fmt.eprintln("single_only?", single_only)

			if single_only {
				// insert shortcut signal
				field = SHORTCUT_VALUE
				// insert string length
				field = bits.bitfield_insert_u32(field, u32(single_index), 0, 26)
				// insert characters
				comp_push_shortcut(t)
			}
		} 

		// if nothing was set
		if field == 0 {
			data := comp_push_data(int(t.count))
			index: int

			for i in 0..<uint(ALPHABET_SIZE) {
				if t.array[i] != 0 {
					field = bits.bitfield_insert_u32(field, 1, i, 1)
					comp_push_ctrie(ctrie_get(t.array[i]), &data[index])
					index += 1
				}
			}
		}

		// only set once
		alphabet_bits^ = field
	}
}

Clever For Loop

Jeroen showed me a neat way to traverse the alphabet bitfield easier and shortcut once the bitfield is empty (early).

Original:

for i in 0..<u32(ALPHABET_SIZE) {
    mask := u32(1 << i)
    
    if alphabet_bits & mask == mask {
        fmt.eprintf("found bit %v = %v\n", i, rune(i + 'a'))
    }
}

Idea - shift the copied bits and test the first bit:

for i in 0..<u32(ALPHABET_SIZE) {
    if alphabet_bits & 1 == 1 {
        fmt.eprintf("found bit %v = %v\n", i, rune(i + 'a'))
    }
    alphabet_bits >>= 1
}

Valid Usage - with index - ends early due to bits > 0:

bits := alphabet_bits
for i := 0; bits > 0; i += 1 {
    if bits & 1 == 1 {
        fmt.eprintf("found bit %v = %v\n", i, rune(i + 'a'))
    }
    bits >>= 1
}

We check for the SHORTCUT_VALUE early but you could also clear the upper bits like this:

ALPHABET_MASK :: u32(1 << 26) - 1
bits := alphabet_bits & ALPHABET_MASK
for i := 0; bits > 0; i += 1 {
    if bits & 1 == 1 {
        fmt.eprintf("found bit %v = %v\n", i, rune(i + 'a'))
    }
    bits >>= 1
}

Update 1 - Source Code

SIZE in 8262660B 8069KB 7MB for CTrie
SIZE in 346683B 338KB 0MB for compressed

-> https://pastebin.com/rbj1weKq

-> ebook print output ebook.txt