2) Let Sig C represent the unique signature of block C.
Enter “rolling hash signatures” (see Wikipedia for more detailed explanation).
Be warned, if you Google/Bing for rolling hash, you’ll probably get some rather different results to what you were expecting.
The size and more importantly the offset locations of all blocks stay consistent.
Determining that some of the blocks are unmodified is easy, and we simply upload the new version of the second block.
🙂 The way rolling hash signatures are generated is essential for this process to be quick enough to be practical.
There are 2 ways of generating the signature: Firstly, you can read N bytes from a file, perform some calculation on the array of bytes and end up with your signature. The other option (and this is the magic) is that if you have already generated a signature for bytes 0 to 3 (for example) you can simply generate the signature for bytes 1 to 4 (ie shifting the byte array by 1) by performing a simple calculation based off the old signature.
The calculations required to find the new offsets are huuuuuge, well potentially huge, well “quite large” would cover it.
For each block that exists in the Azure blob we need to search at every byte offset in the new file.
To put it simply, if the file is 100M in size, and we’re searching for a block that is 10M in size, then the number of comparisons required is (approx) 100 million – 10 million = 90 million.
For example: In the above diagram, we want to determine if block C (that already exists in the cloud) also exists in the updated version of the file. 1) Let Size C represent the size (in bytes) of block C.
What I hope is obvious is that a LOT of signature generation needs to happen as well as lots of comparisons.