Should hexadecimal addresses of a dataset be cleaned?

Question

I am working on fraud detection on blockchains. To be more specific, I fetched a big number of transactions that took place on the blockchain, labeled them to spam / non spam using an appropriate API and now I will train a model to detect fraud using SVM, etc ...

My question is about the preparation of the data. The fields I have are : hash, nonce transaction_index, from_address, to_address,...

The fields "from/to_address" are hexadecimal fields like 0x5e14d30d2155c0cdd65044d7e0f296373f3e92f65ebd

My question is, how should I format this data ? Should I delete this field ? ( I do not think so since it is very relevant to the problem at hand ). I can't find the appropriate encoding, neither.

score 0 · Answer 1 · answered Apr 24 '22 at 21:00

It is fine to leave the "from/to_address" in the model. It would be useful to choose an algorithm that learns to weight the feature appropriately.

The current hexadecimal format would be encoded as a string in most machine learning algorithms. It might be useful to use feature hashing to encoding it into numerical values that are amenable to most machine learning algorithms.

Should hexadecimal addresses of a dataset be cleaned?

1 Answers1