JavaScript using objects as keys

If you ever wondered how associative arrays work or way are they called hashes in Ruby? The answer is hash tables, and if you look into the wikipedia article http://en.wikipedia.org/wiki/Hash_table, you can understand how they work and how simple they are. I will try to be quick and concise in the explanation:

Introduction to Hash Tables

A hashing function is a function which purpose is to create an ‘unique’ identificator of any number of bytes. The resulting identificator, called hash, is only few bytes and for different sizes and contents of data it is always roughly the same size. But in the computing world nothing is infinite nor completely unique, so if a resulting hash is small in size, there is more of a chance that two completely different data, when passed through a hashing function have the same hash. That is called hash collision.

Lets say that now you want to store values under some string names and you want to be able to get the values by passing their string name. You will take two arrays and in one put string names, which are called keys, and in other values. So the array number ID of a key will be the same from the corresponding value. To find a correct value labeled under a key, you must go one by one through the array of keys and check if each is equal to the desired one. Now the problem arises, string comparison is high level and therefore slow, the characters are compared one by one in their numerical form. It would be very neat if they were numbers and finding them would be easy cause instead of for loop over for loop we would have only one. Here come the hashes.

You might be familiar with SHA1 and MD5 checksums, hash functions that are used to verify the correctness of the data or to hide passwords. They provide very large 32 character hexadecimal values and the values have a big dispersion, which means that the collisions are unlikely to occur often. They are one directional algorithms, suitable for hiding password but not prone to brute force… But we would need something smaller and simpler for the noted use, like Cyclic Redundancy Check http://en.wikipedia.org/wiki/Cyclic_redundancy_check, a 32 bit version of it. It is a simple, fast algorithm on the binary level, easy to implement, thus used in hardware signal correctness checks or file consistency checks. This algorithm will produce a  unique set of bits for each string passed that can be interpreted into a decimal integer.

If we hash all the keys  in the key array, and hash the key used for searching, we would find the equal key that holds correspondent value much faster. But CRC32 doesn’t create values dispersed as SHA1, so collisions are to occur more often. That means that we are never sure which two different values will have a hash collision and we have to keep arrays of keys with same hash labeled under that hash, these arrays are called buckets. Which means that when we find the keys labeled under that hash, we have to check them for equality the traditional way. With all these requirements completed we have one functional hash table.

Problem with JavaScript hash tables

Hash tables are implemented in most modern programming languages and the same goes for JavaScript. In JavaScript, keeping values under keys in an object uses hash table to accomplish that task. Everything you use as a key, which is not a string or a number will be converted into the string and used as a key. And now we have the problem which is the sole reason of this post. If you convert an object to string using the .toString() prototype you will get “[object Object]” string a s a result. Try, for instance, this for example:

a = {};
 b = {a:1, b:3};
 a[b] = 1;
 console.log(a); // {[object Object]:1}

That means you are not natively able to use object as a key, nor an array. And we really want that, and here is why:

Imagine you have a really really big array of really really big objects, and that you have to test some objects if they appear in that big array. Normally you would do a deep object comparison between an object you are searching with and each object in the array. That would take a lot of time, especially if object’s contain large strings.
And hashing each object in the large object array during the initialization, and hashing each object used for searching whenever you search would speed up seeking time. First the correct hash would be found, and that hash would lead us to the correct bucket, hopefully containing only one element. Whatever is the case of number of the elements we have to check if object used for search is equal to one of the objects in the bucket, cause of the hash collision. And that’s how we can reduce seeking time to hopefully only one deep object comparison and searching through an array of integers.

Solution

Now when you think about it you can use JSON.stringify() to convert object or an array to string and then use them as a key, but then the keys become too big. Also in JavaScript objects are unordered so {a:1, b:2} != {b:2, a:1}, which means that before doing json stringification you have to recursively alphabetically order the properties inside objects.

The suggested solution won’t exactly be the full Hash Table implementation, it only demonstrates it (we won’t use masking) and it will lean to native hash table which is most likely  fully optimized thus fast like hell. We will only reduce big data to hashes and then use these hashes as keys in native hash table implementation, to escape key limitation size, if there is one, (you can never be too sure) cause maybe the keys are held in active memory and to have a more elegant solution without data redundancy.
The sole purpose of it is to be a value-value map. You put some values inside, you check if these values exist in the list providing some external object/array/string as a seek query and you can also remove stored data. I’ve took the liberty of calling the class HashCache.

a = {'a':1,
 'b':{'c':[1,2,[3,45],4,5],
 'd':{'q':1, 'b':{q':1, 'b':8},'c':4},
 'u':'lol'},
 'e':2};

 b = {'a':1, 
 'b':{'c':[2,3,[1]],
 'd':{'q':3,'b':{'b':3}}},
 'e':2};

 c = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";
var hc = new HashCache([{a:3, b:2, c:5}, {a:15, b:2, c:'foo'}]); //init
 hc.put({a:1, b:1});
 hc.put({b:1, a:1});
 hc.put(true);
 hc.put('true');
 hc.put(a);
 hc.put(c);
 hc.put(d);
 console.log(hc.exists('true'));
 console.log(hc.exists(a));
 console.log(hc.exists(c));
console.log(hc.exists({b:1, a:1}));
 hc.remove(a);
 console.log(hc.exists(c));

The provided solution is using a sligntly modified (for performance) JavaScript implementation of a CRC32 hashing algorithm written by Andrea Ercolino  and can be found here: http://noteslog.com/post/crc32-for-javascript/
Other external functions are orderProperties (self explanatory), and functions described in older posts: JavaScript deep object comparison and JavaScipt data type detection.

Test

The (not so detailed)  tests run on on Intel i5 m450 2.40GHz processor in Google Chrome browser 27.0.1453.93 on Linux 2.6.32-5-amd64. After running the tests on randomly generated JSON array of 10000 objects using  mockJSON generator on two approaches the results were as following (in milliseconds):

Objects used had this structure:

{ 
    "id" : 1490,
    "married" : true,
    "name" : "Larry Smith",
    "sons" : null,
    "daughters" : [ 
        { 
        "age" : 25,
        "name" : "Melissa"
        },
        { 
        "age" : 11,
        "name" : "Melissa"
        }
    ]
 }

For the first approach, that is just a native hash table:

 init: 249.97000000439584
 put: 0.18500001169741154
 seek: 0.5419999943114817
 seek_unexistent: 0.11500000255182385
 remove: 0.1360000460408628

For the second approach is HashCache:

 init: 322.40800000727177
 put: 0.1949999714270234
 seek: 0.3780000498518348
 seek_unexistent: 0.11099997209385037
 remove: 0.21299999207258224

Next test was with the  74000 generated objects, and now the difference is more obvious:

First approach:

 init: 1254.568999982439
 put: 0.22200000239536166
 seek: 0.7060000207275152
 seek_unexistent: 0.16300001880154014
 remove: 0.13800000306218863

Second approach:

 init: 1727.5210000225343
 put: 0.25099999038502574
 seek: 0.3349999897181988
 seek_unexistent: 0.11399999493733048
 remove: 0.2430000458844006

And for the fun of it, average linear search time of a randomized 74 000 object array is around 150ms using the deep object comparison.

What have we learned?

From the tests it is obvious that on a larger scale with HashCache we got slightly faster seek time, and remember that seek here hashes the object and then finds it by hash and then uses deep compare. And because of the same reason init time is longer.

But the native hash table does the job as expected. And when we have in mind that key values in JavaScript are not limited (except memory limitations),  and there is no reason you couldn’t and shouldn’t use it to achieve desirable results with object properties ordering and JSON stringification before putting the object value as a key.

We also learned that CRC32 is one badass fast motherfucker with a nice dispersion rate. I got only one collision in 74000 objects, and that made me very happy. I read somewhere that 32 bit CRC has collision probability 1:2^32, but roughly 1 in 100 000 is cool too.

Result

Well I ran the tests again several times, and I got the average seek time on 74000 objects for native hash map is 0.60 and for HashCache is roughly 0.30! That means 50% faster.

Here is my theory why HashCache is faster than the native hash map:
Native hash converts all keys to string no matter the type of data, and with the value it keeps not the original data in it’s original passed form but the string. So when it finds the data by hash, it compares two strings to see if it is the correct key (because of hash collisions you have to do that). String comparison is slow, that’s why HashCache has better seek time, because it preserves the original data, it is type aware and when it comes to objects or arrays it uses deep compare. You can store true and “true”, and they will be different keys.

The conclusion is that using integers as keys in native associative arrays (hash tables) in JavaScript is around 50% faster (for some reason? random access and no hashing?) than using stings!

Constructive feedback and criticism is always welcome! Please don’t constrain yourselves. :)

About these ads

One comment

  1. Interesting post, thanks for explaining understandably. Also in the 2nd to last paragraph stings should be strings.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 79 other followers

%d bloggers like this: