10

I was assigned this task to analyze the server logs of our application which contains exception logs, database logs event logs etc. I am new to machine learning, we use Spark with elastic search and Sparks MLlib(or PredictionIO).An example of the desired result would be to be able to predict based on the exception logs gathered to be able to predict which user is more likely to cause the next exception and at which feature(and bunch of other stuff to keep track and improve optimization of the application).

I have successfully been able to ingest data from ElasticSearch into spark and create DataFrames and map the data needed. What I would like to know is how do I approach the Machine Learning aspect of my implementation. I've been through articles and papers that talk about Data preprocessing, training the data models and creating labels and then generating predictions.

The questions I have are

  • How do I approach transforming the exiting log data into numerical vectors which can be used to datasets to be trained.

  • What algorithms do I use to train my dataset(with the limited knowledge i've gathered the past couple days, i was thinking bout implementing linear regression, please suggest which implementation would be best)

Just looking for suggestions on how to approach this problem.

Thank You.

elric
  • 111
  • 1
  • 1
  • 3

1 Answers1

12

I don't think you necessarily need to convert the individual log entries into vectors for use in an algorithm. I would guess that what you are interested in is a sequence of log entries, which represent a series of events, ordered in time, which together make up a series of 'cases'. Here the relationship between a series of collected log entries is important.

If this is the case then you might want to consider using Process Mining techniques. This allows you to build models of your process (the use of your application) and determine patterns of process steps, along with errors and rework steps.

There is a good introduction course on Coursera, here. There are even some developed, commercial packages like 'disco' to help you with the analysis and visualisation

Oliver
  • 131
  • 5