4

I am working on a use case, and I'm unsure of the best way to proceed: in order to analyze the behavior of users of a web-based music application, we retain all songs each has played since 2009. We store this information in flat files, each containing the songs played in a day. Each file contains 50M of lines, and we have 19M users. Our entire song catalogue consists of 35M tracks.

The format of these files is as follows:

id-user | country | id-artist | id-track

Question: I would like to represent each user by the songs he or she has played; this profile would be used by the production site. Does anyone have a suggestion for what would be the best way to process the whole chain?

DaL
  • 2,663
  • 13
  • 13
user17241
  • 151
  • 1
  • 7

2 Answers2

3

The first question is: what to do you want to see in user profile?

  • Top-10 tracks, top-10 artist by user?
  • How many tracks/artists a user listens to in a day on average (may be, in last month)?

May be you want to get some general information related to the whole user base:

  • Which artist/track is the most popular among users from different countries (top-N of them)?

The second: you store and want to make aggregates for millions of records. It's not a text file deal. Make a database. Create a table id-user | country | id-artist | id-track. Create another tables with some aggregates from #1, update it regularly and diplay on the front end.

IgorS
  • 5,474
  • 11
  • 34
  • 43
2

You can download a free as in beer software Qlikview that allows you to do interactive data discovery via graphical interface similar to Excel but also featuring a powerful scripting language for data load and transformation. Huge flat files is no problem at all. It is an in-memory technology so you'd need a computer with a lot of RAM. The advantage though is that it can load billions of records in a star schema but still allows you to do discovery ad hoc (second or sub-second time) without writing and rewriting SQL. I am always using it to screen the data and run descriptive statistics + visual exploration on it. From Data Science standpoint this is a very advanced column based data engine integrated with a powerful dictionary of descriptive stat functions and interactive graphics UI. You will be surprised what is possible with your data.

Diego
  • 550
  • 2
  • 8