2

I am working on fraud detection on blockchains. To be more specific, I fetched a big number of transactions that took place on the blockchain, labeled them to spam / non spam using an appropriate API and now I will train a model to detect fraud using SVM, etc ...

My question is about the preparation of the data. The fields I have are : hash, nonce transaction_index, from_address, to_address,...

The fields "from/to_address" are hexadecimal fields like 0x5e14d30d2155c0cdd65044d7e0f296373f3e92f65ebd

My question is, how should I format this data ? Should I delete this field ? ( I do not think so since it is very relevant to the problem at hand ). I can't find the appropriate encoding, neither.

Namrouch
  • 21
  • 1

1 Answers1

0

It is fine to leave the "from/to_address" in the model. It would be useful to choose an algorithm that learns to weight the feature appropriately.

The current hexadecimal format would be encoded as a string in most machine learning algorithms. It might be useful to use feature hashing to encoding it into numerical values that are amenable to most machine learning algorithms.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113