I'm considering a support vector regression model with a prediction $$ \hat{y}(\mathbf{x}_\star)=\boldsymbol{\theta}^{\top} \boldsymbol{\phi}(\mathbf{x}_\star)$$ where $\boldsymbol{\theta}$ are the coefficients to learn, $\boldsymbol{\phi}(\mathbf{x}_\star)$ is a transformation of the input $\mathbf{x}_\star$. The optimisation problem is $$ \widehat{\boldsymbol{\theta}}=\arg \min _{\boldsymbol{\theta}} \frac{1}{n} \sum_{i=1}^n \max \{0,|y_i-\underbrace{\boldsymbol{\theta}^{\top} \boldsymbol{\phi}\left(\mathbf{x}_i\right)}_{\hat{y}\left(\mathbf{x}_i\right)}|-\epsilon\}+\lambda\|\boldsymbol{\theta}\|_2^2 .$$ where the error term is given by the hinge loss and there is l2 regularisation. The parameter $\epsilon$ gives the epsilon-tube within which no penalty is associated in the cost.
Deriving the dual problem via Lagrange multipliers we find that the solution is given by $$\hat{y}\left(\mathbf{x}_{\star}\right)=\hat{\boldsymbol{\alpha}}^{\top} \underbrace{\Phi(\mathbf{X}) \phi\left(\mathbf{x}_{\star}\right)}_{K\left(\mathbf{X}, \mathbf{x}_{\star}\right)}$$ where $\boldsymbol{\alpha}$ is the solution to the optimisation problem $$\hat{\boldsymbol{\alpha}}=\arg \min _{\boldsymbol{\alpha}} \frac{1}{2} \boldsymbol{\alpha}^{\top} \boldsymbol{K}(\mathbf{X}, \mathbf{X}) \boldsymbol{\alpha}-\boldsymbol{\alpha}^{\top} \mathbf{y}+\epsilon\|\boldsymbol{\alpha}\|_1$$ subject to $$ \left|\alpha_i\right| \leq \frac{1}{2 n \lambda} \text {. }$$ In the above, we have used the Gram matrix, which is given by $$\boldsymbol{K}(\mathbf{X}, \mathbf{X})=\begin{bmatrix} \kappa\left(\mathbf{x}_1, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_1, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_1, \mathbf{x}_n\right) \\ \kappa\left(\mathbf{x}_2, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_2, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_2, \mathbf{x}_n\right) \\ \vdots & & \ddots & \vdots \\ \kappa\left(\mathbf{x}_n, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_n, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_n, \mathbf{x}_n\right) \end{bmatrix}$$ where the kernel is $$ \kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\boldsymbol{\phi}(\mathbf{x})^{\top} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right).$$ In support vector classification we don't have the term $||\mathbf{\alpha}||_1$ and the optimisation is a quadratic problem. Can the optimisation above be put into a quadratic form? Intuitively, I feel like it can't. So what numerical algorithms exist to solve problems like this?
