4

Given a two-dimensional data set where each point is labeled $ \{0,1\}$, I want to implement a sparse classifier with $L_p \ \text({ 0<p \leq 1) }$.

I have been reading on logistic regression and regularization. Let me give you an example of what I have been working on. The concrete example is: Let $\left((x^{(i)},y^{(i)} )\right)_{i\in \{1,\dots, m\}} $ be my data set with $y^{(i)}\in \{0,1\} $ and $x^{(i)}\in \mathbb{R}^2$. And the cost function I minimized is

$ J(\theta) = - \frac{1}{m} \cdot \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$.

where $h_\theta(x) = \frac{1}{1+e^{-\theta^{T}x}}$. I thought that this would be a good introduction to sparse.
Currently I use a neural networks and was wondering if I am heading in the right direction in understanding sparse methods.

That leaves me with the question:

What is the definition of sparse classifiers? What would be an example?

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184
A.Dumas
  • 149
  • 3

2 Answers2

1

I'll provide an example model using linear regression, however the idea translates to classification in a straightforward manner.

Sparse linear regression is used when we have a model $y = X\beta + \epsilon$ where $y \in \mathbb{R}^n$, $X \in \mathbb{R}^{n \times p}$, $\beta \in \mathbb{R}^p$ and $\epsilon \in \mathbb{R}^n$ when $n \ll p$. If we expect only a small number of columns in $X$ to actually contribute to $y$ then we can impose a penalty on $\beta$ such that "non-important" columns $X_i$ have their corresponding $\beta_i = 0$. We can formally write this as $$\arg \min_{\beta} \|y - X\beta \|^2_2 + \lambda\|\beta\|_0$$ where $\|\cdot\|_0$ is the $\ell_0$ norm that counts the number of non-zero entries. Unfortunately fitting this model exactly is difficult. We can approximate this objective by using the $\ell_1$ norm instead. That is we find $\beta$ for $$\arg \min_{\beta} \|y - X\beta \|^2_2 + \lambda\|\beta\|_1.$$ This model is known as LASSO in the context of linear regression and can be fit by a variety of methods in relatively little time.

All of this hinges on $n \ll p$. If your data are of the form when $n \approx p$ or $n > p$ I'm not sure if sparsity will help much, as you should have enough data to guide inference to true $\beta$ values (provided other assumptions hold, heteroskedasticity, independence, etc). The key takeaway is that you have a large number of predictors and you suspect a small amount of them to actually characterize $y$.

Nicholas Mancuso
  • 3,927
  • 1
  • 25
  • 39
0

Sparsity implies that only a small portion of input variables are influencing classification. So sparse classifier's job is to find this small portion of variables. An example would be L1-norm based SVM.

Mladja
  • 1