55

I'm working with a number of data lists that are keyed by document name. The document names, while very descriptive, are quite cumbersome if I need to view them on (up to 256 bytes is a lot of real estate) and I'd love to be able to create a smaller keyfield that's readily reproducible in case I need to do a VLOOKUP from another workseet or workbook.

I'm thinking a hash from the title that'd be unique and reproducible for each title would be most appropriate. Is there a function available, or am I looking at developing my own algorithm?

Any thoughts or ideas on this or another strategy?

Andrea
  • 1,536
dwwilson66
  • 1,849

11 Answers11

46

You don't need to write your own function - others already did that for you.
For example I collected and compared five VBA hash functions on this stackoverflow answer

Personally I use this VBA function

  • its called with =BASE64SHA1(A1) in Excel after you copied the macro to a VBA module
  • requires .NET since it uses the library "Microsoft MSXML" (with late binding)

Public Function BASE64SHA1(ByVal sTextToHash As String)
Dim asc As Object
Dim enc As Object
Dim TextToHash() As Byte
Dim SharedSecretKey() As Byte
Dim bytes() As Byte
Const cutoff As Integer = 5

Set asc = CreateObject("System.Text.UTF8Encoding")
Set enc = CreateObject("System.Security.Cryptography.HMACSHA1")

TextToHash = asc.GetBytes_4(sTextToHash)
SharedSecretKey = asc.GetBytes_4(sTextToHash)
enc.Key = SharedSecretKey

bytes = enc.ComputeHash_2((TextToHash))
BASE64SHA1 = EncodeBase64(bytes)
BASE64SHA1 = Left(BASE64SHA1, cutoff)

Set asc = Nothing
Set enc = Nothing

End Function

Private Function EncodeBase64(ByRef arrData() As Byte) As String

Dim objXML As Object
Dim objNode As Object

Set objXML = CreateObject("MSXML2.DOMDocument")
Set objNode = objXML.createElement("b64")

objNode.DataType = "bin.base64"
objNode.nodeTypedValue = arrData
EncodeBase64 = objNode.text

Set objNode = Nothing
Set objXML = Nothing

End Function

Customizing the hash length

  • the hash initially is a 28 characters long unicode string (case sensitive + special chars)
  • You customize the hash length with this line: Const cutoff As Integer = 5
  • 4 digits hash = 36 collisions in 6895 lines = 0.5 % collision rate
  • 5 digits hash = 0 collisions in 6895 lines = 0 % collision rate

There are also hash functions (all three CRC16 functions) which doesn't require .NET and doesn't use external libraries. But the hash is longer and produces more collisions.

You could also just download this example workbook and play around with all 5 hash implementations. As you see there is a good comparison on the first sheet

phuclv
  • 30,396
  • 15
  • 136
  • 260
nixda
  • 27,634
18

I don't care very much about collisions, but needed a weak pseudorandomizer of rows based on a variable-length string field. Here's one insane solution that worked well:

=MOD(MOD(MOD(MOD(MOD(IF(LEN(Z2)>=1,CODE(MID(Z2,1,1))+10,31),1009)*IF(LEN(Z2)>=3,CODE(MID(Z2,3,1))+10,41),1009)*IF(LEN(Z2)>=5,CODE(MID(Z2,5,1))+10,59),1009)*IF(LEN(Z2)>=7,CODE(MID(Z2,7,1))+10,26),1009)*IF(LEN(Z2)>=9,CODE(MID(Z2,9,1))+10,53),1009)

Where Z2 is the cell containing the string you want to hash.

"MOD"s are there to prevent overflowing to scientific notation. 1009 is a prime, could use anything X so that X*255 < max_int_size. 10 is arbitrary; use anything. "Else" values are arbitrary (digits of pi here!); use anything. Location of characters (1,3,5,7,9) are arbitrary; use anything.

Sled
  • 412
12

Here's FNV-1a in 32-bits as a single excel formula, where cell A1 has the string you want to hash:

=LET(
  Y, LAMBDA(G,a,b,m,res,
    IF(0<b,
      G(
        G,
        MOD(a, m) * 2,
        TRUNC(b / 2),
        m,
        IF(MOD(b, 2) = 1,
          MOD(res + MOD(a, m), m),
          res)),
      res)),
  mulmod, LAMBDA(a,b,m, Y(Y,a,b,m,0)),
  p, 16777619    +N("FNV_prime for 32 bits"),
  o, 2166136261  +N("FNV_offset_basis for 32 bits"),
  m, POWER(2,32) +N("modulus for 32 bits"),
  s, A1,
  IF(ISBLANK(s),
    0,
    REDUCE(o,
      ROW(INDIRECT("1:"&LEN(s))),
      LAMBDA(acc,i,mulmod(p, BITXOR(acc, CODE(MID(s,i,1))), m)))))

I protect against arithmetic overflow in the recursive mulmod. a * 2 won't overflow because m == MAX_UNIT == 4294967295 while excel uses doubles for math and they're safe up to 9,007,199,254,740,993

Wrap the REDUCE(...) term in DEC2HEX() for typical hex output.

Examples:

"BAD" => 2775452120 or A56E09D8 (in hex) "DAB" => 772135228 or 2E05D93C "ab" => 1294271946 or 4D2505CA "AB" => 752165258 or 2CD5218A "" => 0 "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." => 2080740573 or 7C0594DD

9

For a reasonably small list you can create a scrambler (poor man's hash function) using built-in Excel functions.

E.g.

 =CODE(A2)*LEN(A2) + CODE(MID(A2,$A$1,$B$1))*LEN(MID(A2,$A$1,$B$1))

Here A1 and B1 hold a random start letter and string length.

A little fiddling and checking and in most cases you can get a workable unique ID quite quickly.

How it Works: The formula uses the first letter of the string and a fixed letter taken from mid-string and uses LEN() as a 'fanning function' to reduce the chance of collisions.

CAVEAT: this is not a hash, but when you need to get something done quickly, and can inspect the results to see that there are no collisions, it works quite well.

Edit: If your strings should have variable lengths (e.g. full names) but are pulled from a database record with fixed width fields, you will want to do it like this:

 =CODE(TRIM(C8))*LEN(TRIM(C8))
       +CODE(MID(TRIM(C8),$A$1,1))*LEN(MID(TRIM(C8),$A$1,$B$1))

so that the lengths are a meaningful scrambler.

Assad Ebrahim
  • 2,110
  • 2
  • 23
  • 30
5

In recent versions of Excel (March 2022 and later), the new array formulas make it possible to create hash functions without VBA.

Here is the formula for Bernstein's djb2 hash function (see e.g. http://www.cse.yorku.ca/~oz/hash.html):

hash_djb2 = LAMBDA(v,
    MAP(
        v,
        LAMBDA(x,
            LET(
                y, VALUETOTEXT(x, 0),
                l, LEN(y),
                REDUCE(
                    5381,
                    SEQUENCE(l),
                    LAMBDA(a, j,
                        LET(
                            z, CODE(MID(y, j, 1)),
                            MOD(a * 33 + z, 2 ^ 32)
                        )
                    )
                )
            )
        )
    )
);

The output is an integer smaller than 2^32 (~4e9). It can be further shortened to 8 characters using DEC2HEX, or to 6 characters with a Base64 implementation.

AndreA
  • 183
3

I am using this which gives pretty good results with preventing clashing without needing to run a script each time. I needed a value between 0 - 1.

=ABS(COS((CODE(MID(A2,ROUNDUP(LEN(A2)/9,0),1))*(CODE(MID(A2,ROUNDUP(LEN(A2)/5,0),1))+100)/CODE(MID(A2,ROUNDUP(LEN(A2)/3,0),1))*(CODE(MID(A2,ROUNDUP(LEN(A2)*8/9,0),1))+25)/CODE(MID(A2,ROUNDUP(LEN(A2)*6/9,0),1))*(CODE(MID(A2,ROUNDUP(LEN(A2)*4/9,0),1))-25))/LEN(A2)+CODE(A2)))

It picks letters from across the string, takes the value of each of those letters, adds a value (to prevent same letters in different places giving same results), multiplies/divides each and runs a COS function over the total.

2

You can try this. Run a Pseudo# on two columns:

=+IF(AND(ISBLANK(D3),ISBLANK(E3)),"",CODE(TRIM(D3&E3))*LEN(TRIM(D3&E3))+CODE(MID(TRIM(D3&E3),$A$1*LEN(D3&E3),1))INT(LEN(TRIM(D3&E3))$B$1))

Where A1 and B1 store random seeds entered manually: 0

Mogget
  • 1,353
1

I've implemented a portable version of the FNV1a 32 bit hash function for Microsoft Excel/LibreOffice Calc/Google Spreadsheets in two flavors: formula-based and macro-based (this one doesn't work for Google sheets).

Take a look here if you like: https://github.com/leogama/spreadsheet-hash

leogama
  • 623
1

Here is a function that simply sums up the code() values of all characters in Cell E2:

=IFERROR(SUM(1*CODE(MID(E2;ROW(OFFSET(A1;0;0;LEN(E2);1));1)));0)

IT is an array function so be sure to press CTRL-shift-Enter after entering it. I guess form there you can go and make it to have fewer collisions.

Exguest
  • 11
1

Here is an enhancement on @AndreA's answer:

  • Return correct value for empty string (should return offset basis, not zero)
  • Do not depend on the machine's encoding settings. Rather it hashes the utf32-BE encoding of the string.
  • Get rid of INDIRECT and use SEQUENCE instead

Tip: Define a named reference to use it as a function!

=LAMBDA(s,
  LET(
    Y,
      LAMBDA(G,a,b,m,res,
        IF(b > 0,
          G(G, MOD(a, m)*2, TRUNC(b/2), m, IF(MOD(b, 2) = 1, MOD(res + MOD(a, m), m), res)),
          res
        )
      ),
    mulmod,
      LAMBDA(a, b, m, Y(Y, a, b, m, 0)),
    utf32be_byte,
      LAMBDA(s,i,
        BITAND(
          BITRSHIFT(
            UNICODE(MID(s, TRUNC(i/4) + 1, 1)),
            8*(3 - MOD(i, 4))
          ),
          255
        )
      ),
    p,
      16777619    +N("FNV_prime for 32 bits"),
    o,
      2166136261  +N("FNV_offset_basis for 32 bits"),
    m,
      POWER(2,32) +N("modulus for 32 bits"),
IF(s=&quot;&quot;,
  o,
  REDUCE(
    o,
    SEQUENCE(4*LEN(s), 1, 0, 1),
    LAMBDA(acc, i, mulmod(p, BITXOR(acc, utf32be_byte(s, i)), m))
  )
)

) )

0

To my knowledge there is no hash function build into Excel - you'd need to build one as a User Defined Function in VBA.

However, please note that for your purpose I don't think using a hash is required or really advantageous! VLOOKUP will work just as well on 256 bytes as it'll be on a smaller hash. Sure, it might be a tiny bit slower - bit that is for sure so small that it is immeasurable. And then adding the hash values is more effort for you - and for Excel...

Peter Albert
  • 3,032