{"title": "Sparse Metric Learning via Smooth Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2214, "page_last": 2222, "abstract": "In this paper we study the problem of learning a low-dimensional (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach for sparse metric learning although the learning model is based on a non-differential loss function. This smooth optimization approach has an optimal convergence rate of $O(1 /\\ell^2)$ for smooth problems where $\\ell$ is the iteration number. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.", "full_text": "Sparse Metric Learning via Smooth Optimization\n\nYiming Ying\u2020, Kaizhu Huang\u2021, and Colin Campbell\u2020\n\n\u2020Department of Engineering Mathematics, University of Bristol,\n\nThe Chinese Academy of Sciences, 100190 Beijing, China\n\n\u2021National Laboratory of Pattern Recognition, Institute of Automation,\n\nBristol BS8 1TR, United Kingdom\n\nAbstract\n\nIn this paper we study the problem of learning a low-rank (sparse) distance ma-\ntrix. We propose a novel metric learning model which can simultaneously con-\nduct dimension reduction and learn a distance matrix. The sparse representation\ninvolves a mixed-norm regularization which is non-convex. We then show that\nit can be equivalently formulated as a convex saddle (min-max) problem. From\nthis saddle representation, we develop an ef\ufb01cient smooth optimization approach\n[17] for sparse metric learning, although the learning model is based on a non-\ndifferentiable loss function. Finally, we run experiments to validate the effective-\nness and ef\ufb01ciency of our sparse metric learning model on various datasets.\n\n1 Introduction\n\nFor many machine learning algorithms, the choice of a distance metric has a direct impact on their\nsuccess. Hence, choosing a good distance metric remains a challenging problem. There has been\nmuch work attempting to exploit a distance metric in many learning settings, e.g. [8, 9, 10, 12, 20,\n22, 23, 25]. These methods have successfully indicated that a good distance metric can signi\ufb01cantly\nimprove the performance of k-nearest neighbor classi\ufb01cation and k-means clustering, for example.\nA good choice of a distance metric generally preserves the distance structure of the data: the dis-\ntance between examples exhibiting similarity should be relatively smaller, in the transformed space,\nthan between examples exhibiting dissimilarity. For supervised classi\ufb01cation, the label information\nindicates whether the pair set is in the same class (similar) or in the different classes (dissimilar). In\nsemi-supervised clustering, the side information conveys the information that a pair of samples are\nsimilar or dissimilar to each other. Since it is very common that the presented data is contaminated\nby noise, especially for high-dimensional datasets, a good distance metric should also be minimally\nin\ufb02uenced by noise. In this case, a low-rank distance matrix would produce a better generalization\nperformance than non-sparse counterparts and provide a much faster and ef\ufb01cient distance calcula-\ntion for test samples. Hence, a good distance metric should also pursue dimension reduction during\nthe learning process.\nIn this paper we present a novel approach to learn a low-rank (sparse) distance matrix. We\n\ufb01rst propose in Section 2 a novel metric learning model for estimating the linear transforma-\ntion (equivalently distance matrix) that combines and retains the advantages of existing methods\n[8, 9, 12, 20, 22, 23, 25]. Our method can simultaneously conduct dimension reduction and learn a\nlow-rank distance matrix. The sparse representation is realized by a mixed-norm regularization used\nin various learning settings [1, 18, 21]. We then show that this non-convex mixed-norm regulariza-\ntion framework is equivalent to a convex saddle (min-max) problem. Based on this equivalent rep-\nresentation, we develop, in Section 3, Nesterov\u2019s smooth optimization approach [16, 17] for sparse\nmetric learning using smoothing approximation techniques, although the learning model is based on\na non-differentiable loss function. In Section 4, we demonstrate the effectiveness and ef\ufb01ciency of\nour sparse metric learning model with experiments on various datasets.\n\n1\n\n\f2 Sparse Distance Matrix Learning Model\nWe begin by introducing necessary notation. Let Nn = {1, 2, . . . , n} for any n \u2208 N. The space\nof symmetric d times d matrices will be denoted by S d. If S \u2208 S d is positive de\ufb01nite, we write\nit as S (cid:186) 0. The cone of positive semi-de\ufb01nite matrices is denoted by S d\n+ and denote by O d\nthe set of d times d orthonormal matrices. For any X, Y \u2208 Rd\u00d7q, (cid:104)X, Y (cid:105) := Tr(X(cid:62)Y ) where\nTr(\u00b7) denotes the trace of a matrix. The standard Euclidean norm is denoted by (cid:107) \u00b7 (cid:107). Denote by\ni ) \u2208 Rd,\nz := {(xi, yi) : i \u2208 Nn} a training set of n labeled examples with input xi = (x1\nclass label yi (not necessary binary) and let xij = xi \u2212 xj.\nLet P = (P(cid:96)k)(cid:96),k\u2208Nd \u2208 Rd\u00d7d be a transformation matrix. Denote by \u02c6xi = P xi for any i \u2208 Nn and\nby \u02c6x = { \u02c6xi : i \u2208 Nn} the transformed data matrix. The linear transformation matrix P induces a\ndistance matrix M = P (cid:62)P which de\ufb01nes a distance between xi and xj given by\n\ni , . . . , xd\n\ndM (xi, xj) = (xi \u2212 xj)(cid:62)M(xi \u2212 xj).\n\nOur sparse metric learning model is based on two principal hypotheses: 1) a good choice of distance\nmatrix M should preserve the distance structure, i.e. the distance between similar examples should\nbe relatively smaller than between dissimilar examples; 2) a good distance matrix should also be\nable to effectively remove noise leading to dimension reduction.\nFor the \ufb01rst hypothesis, the distance structure in the transformed space can be speci\ufb01ed, for example,\nby the following constraints: (cid:107)P (xj \u2212 xk)(cid:107)2 \u2265 (cid:107)P (xi \u2212 xj)(cid:107)2 + 1,\u2200(xi, xj) \u2208 S and (xj, xk) \u2208\nD, where S denotes the similarity pairs and D denotes the dissimilarity pairs based on the label\ninformation. Equivalently,\n\n(cid:107)\u02c6xj \u2212 \u02c6xk)(cid:107)2 \u2265 (cid:107)\u02c6xi \u2212 \u02c6xj(cid:107)2 + 1,\u2200(xi, xj) \u2208 S and (xj, xk) \u2208 D.\n\n(1)\nFor the second hypothesis, we use a sparse regularization to give a sparse solution. This regu-\n(cid:80)\nlarization ranges from element-sparsity for variable selection to a low-rank matrix for dimension\nreduction [1, 2, 3, 13, 21]. In particular, for any (cid:96) \u2208 Nd, denote the (cid:96)-th row vector of P by P(cid:96)\nand (cid:107)P(cid:96)(cid:107) = (\n2 . If (cid:107)P(cid:96)(cid:107) = 0 then the (cid:96)-th variable in the transformed space becomes\ni = P(cid:96)xi = 0 which means that (cid:107)P(cid:96)(cid:107) = 0 has the effect of eleminating (cid:96)-th variable.\nzero, i.e. x(cid:96)\nMotivated by the above observation, a direct way would be to enforce a L1-norm across the vector\n((cid:107)P1(cid:107), . . . ,(cid:107)Pd(cid:107)), i.e.\n(cid:107)P(cid:96)(cid:107). This L1-regularization yields row-vector (feature) sparsity of\n\u02c6x which plays the role of feature selection. Let W = P (cid:62)P = (W1, . . . , Wd) and we can easily\nshow that\nMotivated by this observation, instead of L1-regularization over vector ((cid:107)P1(cid:107), . . . ,(cid:107)Pd(cid:107)) we can\nenforce L1-norm regularization across the vector ((cid:107)W1(cid:107), . . . ,(cid:107)Wd(cid:107)). However, a low-dimensional\nprojected space \u02c6x does not mean that its row-vector (feature) should be sparse.\nIdeally, we ex-\npect that the principal component of \u02c6x can be sparse. Hence, we introduce an extra orthonormal\ntransformation U \u2208 O d and let \u02c6xi = P U xi. Denote a set of triplets T by\n\nW(cid:96) \u2261 0 \u21d0\u21d2 P(cid:96) \u2261 0.\n\nT = {\u03c4 = (i, j, k) : i, j, k \u2208 Nn , (xi, xj) \u2208 S and (xj, xk) \u2208 D}.\n\n(2)\nBy introducing slack variables \u03be in constraints (1), we propose the following sparse (low-rank)\ndistance matrix learning formulation:\n\n(cid:80)\n\n(cid:96)k) 1\nP 2\n\nk\u2208Nd\n\n(cid:96)\u2208Nd\n\n(cid:80)\n\u03c4 \u03be\u03c4 + \u03b3||W||2\n\n(2,1)\n\nmin\nU\u2208Od\n\nmin\nW\u2208S d\n+\ns.t.\n\n(cid:80)\n\n(cid:80)\n\n1 + x(cid:62)\n\nijU(cid:62)W U xij \u2264 x(cid:62)\n\u03be\u03c4 \u2265 0, \u2200\u03c4 = (i, j, k) \u2208 T , and W \u2208 S d\n+.\n\nkjU(cid:62)W U xkj + \u03be\u03c4 ,\n\n(3)\n\nwhere ||W||(2,1) =\n2 denotes the (2, 1)-norm of W . A similar mixed (2, 1)-norm\nregularization was used in [1, 18] for multi-task learning and multi-class classi\ufb01cation to learn the\nsparse representation shared across different tasks or classes.\n\nk(cid:96)) 1\n\nk w2\n\n(cid:96)(\n\n2.1 Equivalent Saddle Representation\n\nWe now turn our attention to an equivalent saddle (min-max) representation for sparse metric learn-\ning (3) which is essential for developing optimization algorithms in the next section. To this end, we\nneed the following lemma which develops and extends a similar version in multi-task learning [1, 2]\nto the case of learning a positive semi-de\ufb01nite distance matrix.\n\n2\n\n\f(cid:88)\nmin\nU\u2208Od\nijM xij \u2264 x(cid:62)\n\u03be\u03c4 \u2265 0 \u2200\u03c4 = (i, j, k) \u2208 T , and M \u2208 S d\n+.\n\n\u03be\u03c4 + \u03b3||U(cid:62)M U||2\n\nkjM xkj + \u03be\u03c4 ,\n\n(2,1)\n\n\u03c4\n\nmin\nM\u2208S d\n+\ns.t. x(cid:62)\n\n(5)\n\nLemma 1. Problem (3) is equivalent to the following convex optimization problem\n\n(1 + x(cid:62)\n\nijM xij \u2212 x(cid:62)\n\nkjM xkj)+ + \u03b3(Tr(M))2\n\n(4)\n\n(cid:88)\n\nmin\nM(cid:186)0\n\n\u03c4 =(i,j,k)\u2208T\n\nProof. Let M = U W U(cid:62) in equation (3) and then W = U(cid:62)M U. Hence, (3) is reduced to the\nfollowing\n\nNow, for any \ufb01xed M in equation (5), by the eigen-decomposition of M there exists (cid:101)U \u2208 O d\nsuch that M = (cid:101)U(cid:62)\u03bb(M)(cid:101)U . Here, the diagonal matrix \u03bb(M) = diag(\u03bb1, \u03bb2, . . . , \u03bbd) where \u03bbi is\nthe i-th eigenvalue of M. Let V = (cid:101)U U \u2208 O d, and then we have minU\u2208Od ||U(cid:62)M U||(2,1) =\nminU\u2208Od ||((cid:101)U U)(cid:62)\u03bb(M)(cid:101)U U||(2,1) = minV \u2208Od ||V (cid:62)\u03bb(M)V ||(2,1). Observe that\n(cid:162) 1\n\n(cid:80)\n(cid:80)\n(cid:80)\nki \u2264(cid:161)(cid:80)\nthis back into (6) yields ||V (cid:62)\u03bb(M)V ||(2,1) \u2265(cid:80)\n(cid:80)\n\nj VkiVk(cid:48)i)\u03bbkVkj\u03bbk(cid:48) Vk(cid:48)j\nwhere, in the last equality, we use the fact that V \u2208 O d, i.e.\nj VkjVk(cid:48)j = \u03b4kk(cid:48) . Applying Cauchy-\n2 . Putting\nSchwartz\u2019s inequality implies that\nk V 2\nk \u03bb2\nkV 2\nki\nk \u03bbk = Tr(M), where we use the\nk \u03bbkV 2\nfact V \u2208 O d again. However, if we select V to be identity matrix Id, ||V (cid:62)\u03bb(M)V ||(2,1) = Tr(M).\nHence, minU\u2208Od ||U(cid:62)M U||(2,1) = minV \u2208Od ||V (cid:62)\u03bb(M)V ||(2,1) = Tr(M). Putting this back into\nequation (5) the result follows.\n\n||V (cid:62)\u03bb(M)V ||(2,1) =\n=\n\n(cid:80)\n(cid:80)\n(cid:80)\n\n(cid:80)\n(cid:80)\n\n(cid:162) 1\n(cid:80)\n(cid:162) 1\n\nk Vki\u03bbkVkj)2) 1\n\n(cid:161)(cid:80)\n\n(cid:161)(cid:80)\n\n(cid:161)(cid:80)\n\n2 (\nki =\n\nk \u03bbkV 2\n\n(cid:80)\n\n(cid:162) 1\n\nk,k(cid:48)(\n\nki) 1\n\nkV 2\nki\n\nk \u03bb2\n\nkV 2\nki\n\nk \u03bb2\n\n2 =\n\n2 =\n\n(6)\n\n2\n\ni\n\ni(\n\nj(\n\n2\n\ni\n\ni\n\nFrom the above lemma, we are ready to present an equivalent saddle (min-max) representation of\nT /\u03b3 }\nproblem (3). First, let Q1 = {u\u03c4 : \u03c4 \u2208 T , 0 \u2264 u\u03c4 \u2264 1} and Q2 = {M \u2208 S d\nwhere T is the cardinality of triplet set T i.e. T = #{\u03c4 \u2208 T }.\nTheorem 1. Problem (4) is equivalent to the following saddle representation\nij), M(cid:105) \u2212 \u03b3(Tr(M))2\n\njk \u2212 xijx(cid:62)\n\nu\u03c4 (xjkx(cid:62)\n\n(cid:88)\n\n(cid:110)\n\n(7)\n\n\u2212\n\nu\u03c4\n\n+ : Tr(M) \u2264(cid:112)\n(cid:111)\n(cid:88)\n\nt\u2208T\n\n(cid:104)\n\u03c4 =(i,j,k)\u2208T\n\nmin\nu\u2208Q1\n\nmax\nM\u2208Q2\n\n\u03b3(Tr(M\u2217))2 \u2264(cid:80)\nyields that Tr(M\u2217) \u2264(cid:112)\n\nkjM xik \u2212 x(cid:62)\n\u03c4\u2208T (1 + x(cid:62)\n(cid:88)\n\nmin\nM\u2208Q2\n\nProof. Suppose that M\u2217 is an optimal solution of problem (4). By its de\ufb01nition, there holds\nkjM xkj)+ + \u03b3(Tr(M))2 for any M (cid:186) 0. Letting M = 0\n\nT /\u03b3. Hence, problem (4) is identical to\n\n(1 + x(cid:62)\n\nijM xij \u2212 x(cid:62)\n\nkjM xkj)+ + \u03b3(Tr(M))2.\n\n(8)\n\n\u03c4 =(i,j,k)\u2208T\n\n(cid:80)\nObserve that s+ = max{0, s} = max\u03b1{s\u03b1 : 0 \u2264 \u03b1 \u2264 1}.\n(cid:111)\nkjM xik \u2212 x(cid:62)\n\u03c4\u2208T u\u03c4 (1 + x(cid:62)\nabove equation can be written as minM\u2208Q2 max0\u2264u\u22641\n\u03b3(Tr(M))2. By the min-max theorem (e.g.\nthe above problem is equivalent\n[5]),\nijM xij + x(cid:62)\nminu\u2208Q1 maxM\u2208Q2\nijM xij = (cid:104)xjkx(cid:62)\nthis with the fact that x(cid:62)\ntheorem.\n\nthe\nijM xij) +\nto\n\u03c4\u2208T ut. Combining\nij, M(cid:105) completes the proof of the\n\n\u03c4\u2208T u\u03c4 (\u2212x(cid:62)\njkM xjk \u2212 x(cid:62)\n\njkM xjk) \u2212 \u03b3(Tr(M))2\n\njk \u2212 xijx(cid:62)\n\n\u2212(cid:80)\n\n(cid:110)(cid:80)\n\nConsequently,\n\n2.2 Related Work\n\nThere is a considerable amount of work on metric learning. In [9], an information-theoretic approach\nto metric learning (ITML) is developed which equivalently transforms the metric learning problem\n\n3\n\n\fto that of learning an optimal Gaussian distribution with respect to an relative entropy. The method\nof Relevant Component analysis (RCA)[7] attempts to \ufb01nd a distance metric which can minimize\nthe covariance matrix imposed by the equivalence constraints. In [25], a distance metric for k-means\nclustering is then learned to shrink the averaged distance within the similar set while enlarging the\naverage distance within the dissimilar set simultaneously. All the above methods generally do not\nyield sparse solutions and only work within their special settings. Maximally Collapsing Metric\nLearning (MCML) tries to map all points in a same class to a single location in the feature space via\na stochastic selection rule. There are many other metric learning approaches in either unsupervised\nor supervised learning setting, see [26] for a detailed review. We particularly mention the following\nwork which is more related to our sparse metric learning model (3).\n\u2022 Large Margin Nearest Neighbor (LMNN) [23, 24]: LMNN aims to explore a large margin nearest\nneighbor classi\ufb01er by exploiting nearest neighbor samples as side information in the training set.\nSpeci\ufb01cally, let Nk(x) denotes the k-nearest neighbor of sample x and de\ufb01ne the similar set S =\n{(xi, xj) : xi \u2208 N (xj), yi = yj} and D = {(xj, xk) : xk \u2208 N (xj), yk (cid:54)= yj}. Then, recall that the\ntriplet set T is given by equation (2), the framework LMNN can be rewritten as the following:\n\nmin\nM(cid:186)0\n\n\u03c4 =(i,j,k)\u2208T\n\n(1 + x(cid:62)\n\nijM xij \u2212 x(cid:62)\n\nkjM xkj)+ + \u03b3Tr(CM)\n(cid:80)\nwhere the covariance matrix C over the similar set S is de\ufb01ned by C =\n(xi,xj )\u2208S(xi \u2212 xj)(xi \u2212\nxj)(cid:62). From the above reformulation, we see that LMNN also involves a sparse regularization term\nTr(CM). However, the sparsity of CM does not imply the sparsity of M, see the discussion in the\nexperimental section. Large Margin Component Analysis (LMCA) [22] is designed for conducting\nclassi\ufb01cation and dimensionality reduction simultaneously. However, LMCA controls the sparsity\nby directly specifying the dimensionality of the transformation matrix and it is an extended version\nof LMNN. In practice, this low dimensionality is tuned by ad hoc methods such as cross-validation.\n\u2022 Sparse Metric Learning via Linear Programming (SMLlp) [20]: the spirit of this approach is\ncloser to our method where the following sparse framework was proposed:\n\n(9)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nmin\nM(cid:186)0\n\n(cid:80)\n\n(1 + x(cid:62)\n\nijM xij \u2212 x(cid:62)\n\nkjM xkj)+ + \u03b3\n\n|M(cid:96)k|\n\n(10)\n\nt=(i,j,k)\u2208T\n|M(cid:96)k| can only enforce the element sparsity of M. The\nHowever, the above 1-norm term\nlearned sparse model would not generate an appropriate low-ranked principal matrix M for metric\nlearning. In order to solve the above optimization problem, [10] further proposed to restrict M to the\nspace of diagonal dominance matrices: a small subspace of the positive semi-de\ufb01nite cone. Such a\nrestriction would only result in a sub-optimal solution, although the \ufb01nal optimization is an ef\ufb01cient\nlinear programming problem.\n\n(cid:96),k\u2208Nd\n\n(cid:96),k\u2208Nd\n\n3 Smooth Optimization Algorithms\n\nNesterov [17, 16] developed an ef\ufb01cient smooth optimization method for solving convex program-\nming problems of the form minx\u2208Q f(x) where Q is a bounded closed convex set in a \ufb01nite-\ndimensional real vector space E. This smooth optimization usually requires f to be differentiable\nwith Lipschitz continuous gradient and it has an optimal convergence rate of O(1/t2) for smooth\nproblems where t is the iteration number. Unfortunately, we can not directly apply the smooth opti-\nmization method to problem (4) since the hinge loss there is not continuously differentiable. Below\nwe show the smooth approximation method [17] can be approached through the saddle representa-\ntion (7).\n\n3.1 Nesterov\u2019s Smooth Approximation Approach\n\nWe brie\ufb02y review Nesterov\u2019s approach [17] in the setting of a general min-max problem using\nsmoothing techiniques. To this end, we introduce some useful notation. Let Q1 (resp. Q2) be non-\nempty convex compact sets in \ufb01nite-dimensional real vector spaces E1 (resp. E2) endowed with\nnorm (cid:107) \u00b7 (cid:107)1 (resp. (cid:107) \u00b7 (cid:107)2). Let E\u2217\n2 be the dual space of E2 with standard norm de\ufb01ned, for any\n2 = max{(cid:104)s, x(cid:105)2 : (cid:107)x(cid:107)2 = 1}, where the scalar product (cid:104)\u00b7,\u00b7(cid:105)2 denotes the value\ns \u2208 E\u2217\nof s at x. Let A : E1 \u2192 E\u2217\n1 is de\ufb01ned,\n\n2 be a linear operator. Its adjoint operator A\u2217 : E2 \u2192 E\u2217\n\n2, by (cid:107)s(cid:107)\u2217\n\n4\n\n\f(cid:80)\nSmooth Optimization Algorithm for Sparse Metric Learning (SMLsm)\n1. Let \u03b5 > 0, t = 0 and initialize u(0) \u2208 Q1, M (\u22121) = 0 and let L = 1\n\u03c4\u2208T (cid:107)X\u03c4(cid:107)2\n(cid:111)\n2. Compute M\u00b5(u(t)) and \u2207\u03c6\u00b5(u(t)) = (\u22121 + (cid:104)X\u03c4 , M\u00b5(u(t))(cid:105) : \u03c4 \u2208 T )\nand let M (t) = t\nt+2 M\u00b5(ut)\n2 (cid:107)u(t) \u2212 z(cid:107)2 + \u2207\u03c6\u00b5(u(t))(cid:62)(z \u2212 u(t))\n2 (cid:107)u(0) \u2212 v(cid:107)2 +\n\n3. Compute z(t) = arg minz\u2208Q1\n4. Compute v(t) = arg minv\u2208Q1\n5. Set u(t+1) = 2\n6. Set t \u2190 t + 1. Go to step 2 until the stopping criterion less than \u03b5\n\n(cid:110)\nt+2 M (t\u22121) + 2\n(cid:110)\n\nt+3 v(t) + t+1\n\ni=0( i+1\n2 )\n\n(cid:80)t\n\nt+3 z(t)\n\n(cid:161)\n\n2\u00b5\n\nL\n\nL\n\n2\n\n(cid:162)(cid:111)\n\n\u03c6\u00b5(u(i)) + \u2207\u03c6\u00b5(u(i))(cid:62)(v \u2212 u(i))\n\nTable 1: Pseudo-code of \ufb01rst order Nesterov\u2019s method\n\n(cid:111)\n\nmin\nu\u2208Q1\n\nfor any x \u2208 E2 and u \u2208 E1, by (cid:104)Au, x(cid:105)2 = (cid:104)A\u2217x, u(cid:105)1. The norm of such a operator is de\ufb01ned by\n(cid:107)A(cid:107)1,2 = maxx,u {(cid:104)Au, x(cid:105)2 : (cid:107)x(cid:107)2 = 1,(cid:107)u(cid:107)1 = 1} .\nNow, the min-max problem considered in [17, Section 2] has the following special structure:\n\n(cid:110)\n\u03c6(u) = (cid:98)\u03c6(u) + max{(cid:104)Au, x(cid:105)2 \u2212 \u02c6f(x) : x \u2208 Q2}\n\nHere, (cid:98)\u03c6(u) is assumed to be continuously differentiable and convex with Lipschitz continuous gra-\n\ndient and \u02c6f(x) is convex and differentiable. The above min-max problem is usually not smooth and\nNesterov [17] proposed a smoothing approximation approach to solve the above problem:\n\n(cid:110)\n(cid:111)\n\u03c6\u00b5(u) = (cid:98)\u03c6(u) + max{(cid:104)Au, x(cid:105)2 \u2212 \u02c6f(x) \u2212 \u00b5d2(x) : x \u2208 Q2}\n\n(12)\nHere, d2(\u00b7) is a continuous proxy-function, strongly convex on Q2 with some convexity parameter\n\u03c32 > 0 and \u00b5 > 0 is a small smoohting parameter. Let x0 = arg minx\u2208Q2 d2(x). Without loss\nof generality, assume d2(x0) = 0. The strong convexity of d2(\u00b7) with parameter \u03c32 means that\nd2(x) \u2265 1\n2. Since d2(\u00b7) is strongly convex, the solution of the maximization problem\n\u02c6\u03c6\u00b5(u) := max{(cid:104)Au, x(cid:105)2 \u2212 \u02c6f(x) \u2212 \u00b5d2(x) : x \u2208 Q2} is unique and differentiable, see [6, Theorem\n4.1]. Indeed, it was established in [17, Theorem 1 ] that the gradient of \u03c6\u00b5 is given by\n\n2 \u03c32(cid:107)x \u2212 x0(cid:107)2\n\nmin\nu\u2208Q1\n\n(11)\n\n.\n\n.\n\n\u2207 \u02c6\u03c6\u00b5(u) = A\u2217x\u00b5(u)\n(cid:107)A(cid:107)2\n1,2\n\u00b5\u03c32\n\n, i.e. (cid:107)A\u2217x\u00b5(u1) \u2212 A\u2217x\u00b5(u2)(cid:107)\u2217\n\n(13)\n(cid:107)u1 \u2212 u2(cid:107)1.\nand it has a Lipschitz constant L =\nHence, the proxy-function d2 can be regarded as a generalized Moreau-Yosida regularization term\nto smooth out the objective function.\nAs mentioned above, function \u03c6\u00b5 in problem (12) is differentiable with Lipschitz continuous gra-\ndients. Hence, we can apply the optimal smooth optimization scheme [17, Section 3] to the\nsmooth approximate problem (12). The optimal scheme needs another proxy-function d(u) as-\nsociated with Q1. Assume that d(u0) = minu\u2208Q1 d(u) = 0 and it has convexity parameter \u03c3 i.e.\nd(u) \u2265 1\n2 \u03c3(cid:107)u \u2212 u0(cid:107)1. For this special problem (12), the primal solution u\u2217 \u2208 Q1 and dual solution\nx\u2217 \u2208 Q2 can be simultaneously obtained, see [17, Theorem 3]. Below, we will apply this general\nscheme to solve the min-max representation (7) of the sparse metric learning problem (3), and hence\nsolves the original problem (4).\n\n1 \u2264 (cid:107)A(cid:107)2\n\n\u00b5\u03c32\n\n1,2\n\n3.2 Smooth Optimization Approach for Sparse Metric Learning\n\nWe now turn our attention to developing a smooth optimization approach for problem (4). Our main\nidea is to connect the saddle representation (7) in Theorem 1 with the special formulation (11).\nTo this end, \ufb01rstly let E1 = RT with standard Euclidean norm (cid:107) \u00b7 (cid:107)1 = (cid:107) \u00b7 (cid:107) and E2 = S d with\nFrobenius norm de\ufb01ned, for any S \u2208 S d, by (cid:107)S(cid:107)2\nij. Secondly, the closed convex sets\nS2\nare respectively given by Q1 = {u = (u\u03c4 : \u03c4 \u2208 T ) \u2208 [0, 1]T} and Q2 = {M \u2208 S d\n+ : Tr(M) \u2264\nT /\u03b3}. Then, de\ufb01ne the proxy-function d2(M) = (cid:107)M(cid:107)2. Consequently, the proxy-function d2(\u00b7)\nis strongly convex on Q2 with convexity parameter \u03c32 = 2. Finally, for any \u03c4 = (i, j, k) \u2208 T , let\n\n(cid:80)\n\n(cid:112)\n\ni,j\u2208Nd\n\n2 =\n\n5\n\n\fij. In addition, we replace the variable x by M and (cid:98)\u03c6(u) = \u2212(cid:80)\n\nX\u03c4 = xjkx(cid:62)\n\u03c4\u2208T u\u03c4 in\n(12), \u02c6f(M) = \u03b3(Tr(M))2. Finally, de\ufb01ne the linear operator A : RT \u2192 (S d)\u2217, for any u \u2208 RT , by\n(14)\n\njk \u2212 xijx(cid:62)\n\n(cid:88)\n\nAu =\n\nu\u03c4 X\u03c4 .\n\n\u03c4\u2208T\n\nWith the above preparations, the saddle representation (7) exactly matches the special structure (11)\nwhich can be approximated by problem (12) with \u00b5 suf\ufb01ciently small. The norm of the linear\noperator A can be estimated as follows.\nLemma 2. Let the linear operator A be de\ufb01ned as above, then (cid:107)A(cid:107)1,2 \u2264\nfor any M \u2208 S d, (cid:107)M(cid:107)2 denotes the Frobenius norm of M.\nProof. For any u \u2208 Q1 and M \u2208 S d, we have that\nTr\n\n(cid:162) \u2264(cid:161)(cid:80)\n\n\u03c4\u2208T (cid:107)X\u03c4(cid:107)2\n\n(cid:162)(cid:107)M(cid:107)2\n(cid:162) 1\n(cid:161)(cid:80)\n(cid:161)(cid:80)\n\u03c4\u2208T u\u03c4(cid:107)X\u03c4(cid:107)2\n\u03c4\u2208T (cid:107)X\u03c4(cid:107)2\n\u03c4\u2208T u2\n\u03c4\n\n2 = (cid:107)M(cid:107)2(cid:107)u(cid:107)1\n\n(cid:161)(cid:161)(cid:80)\n\n(cid:179)(cid:80)\n\n\u2264 (cid:107)M(cid:107)2\n\n\u03c4\u2208T u\u03c4 X\u03c4\n\n2 where,\n\n(cid:180) 1\n\n(cid:162) 1\n\n(cid:162)\n\nM\n\n2\n\n2\n\n2\n\n(cid:162) 1\n\n(cid:161)(cid:80)\n(cid:180)\n\u03c4\u2208T (cid:107)X\u03c4(cid:107)2\n\u03c4\u2208T u\u03c4 X\u03c4 )M\n\n2\n\n2 .\n\n:\n\n(cid:169)\n\n(cid:179)\n\n(cid:80)\n\nTr\n\n(\n\nCombining the above inequality with the de\ufb01nition that (cid:107)A(cid:107)1,2 = max\n(cid:107)u(cid:107)1 = 1,(cid:107)M(cid:107)2 = 1\n\nyields the desired result.\n\n(cid:170)\n\n\u03c4\u2208T\n\nWe now can adapt the smooth optimization [17, Section 3 and Theorem 3] to solve the smooth\napproximation formulation (12) for metric learning. To this end, let the proxy-function d in Q1 be\nthe standard Euclidean norm i.e. for some u(0) \u2208 Q1 \u2286 RT , d(u) = (cid:107)u \u2212 u(0)(cid:107)2. The smooth\noptimization pseudo-code for problem (7) (equivalently problem (4)) is outlined in Table 1. One can\nstop the algorithm by monitoring the relative change of the objective function or change in the dual\ngap.\nThe ef\ufb01ciency of Nesterov\u2019s smooth optimization largely depends on Steps 2, 3, and 4 in Table 1.\nSteps 3 and 4 can be solved straightforward where z(t) = min(max(0, u(t) \u2212\u2207\u03c6\u00b5(u(t))/L), 1) and\n(cid:88)\ni=0(i + 1)\u2207\u03c6\u00b5(u(i))/2L), 1). The solution M\u03b3(u) in Step 2 involves\n\nv(t) = min(max(0, u(0) \u2212(cid:80)t\n\nthe following problem\n\nM\u00b5(u) = arg max{(cid:104)\n\nu\u03c4 X\u03c4 , M(cid:105) \u2212 \u03b3(Tr(M))2 \u2212 \u00b5(cid:107)M(cid:107)2\n\n2 : M \u2208 Q2}.\n\n(15)\n\n(16)\n\ni :\ns2\n\ni\u2208Nd\n\ni\u2208Nd\n\n(cid:111)\n\nsi \u2264\n\n(cid:112)\n\n(cid:80)\n\n(cid:88)\n\n(cid:88)\n\nsi)2\u2212 \u00b5\n\n\u03bbisi\u2212 \u03b3(\n\nt\u2208T utXt by\n\nT /\u03b3, and si \u2265 0 \u2200i \u2208 Nd\n\n(cid:110)(cid:88)\n(cid:80)\nthat Tr(XY ) \u2264 (cid:80)\n\nThe next lemma shows it can be ef\ufb01ciently solved by quadratic programming (QP).\nLemma 3. Problem (15) is equivalent to the following\ns\u2217 = arg max\n\n(cid:88)\n(cid:80)\ni\u2208Nd\nt\u2208T utXt. Moreover, if we denotes the eigen-\nt\u2208T utXt = Udiag(\u03bb)U(cid:62) with some U \u2208 O d then the optimal\n\ni\u2208Nd\nwhere \u03bb = (\u03bb1, . . . , \u03bbd) are the eigenvalues of\ndecomposition\nsolution of problem (15) is given by M\u00b5(u) = Udiag(s\u2217)U(cid:62).\nProof. We know from Von Neumann\u2019s inequality (see [14] or [4, Page 10]), for all X, Y \u2208 S d,\n\u03bbi(X)\u03bbi(Y ) where \u03bbi(X) and \u03bbi(Y ) are the eigenvalues of X and Y in\nnon-decreasing order, respectively. The equality is attained whenever X = Udiag(\u03bb(X))U(cid:62), Y =\nUdiag(\u03bb(Y ))U(cid:62) for some U \u2208 O d. The desired result follows by applying the above inequality\nwith X =\nIt was shown in [17, Theorem 3] that the iteration complexity is of O(1/\u03b5) for \ufb01nding a \u03b5-optimal\nsolution if we choose \u00b5 = O(\u03b5). This is usually much better than the standard sub-gradient descent\nmethod with iteration complexity typically O(1/\u03b52). As listed in Table 1, the complexity for each\niteration mainly depends on the eigen-decomposition on\nutXt and the quadratic program-\n(cid:80)\nming to solve problem (15) which has complexity O(d3). Hence, the overall iteration complexity of\nthe smooth optimization approach for sparse metric learning is of the order O(d3/\u03b5) for \ufb01nding an\n\u03c4 (cid:107)X\u03c4(cid:107)2 could be too\n\u03b5-optimal solution. As a \ufb01nal remark, the Lipschitz given by the L = 1\n2\u00b5\nloose in reality. One can use the line search scheme [15] to further accelerate the algorithm.\n\n\u03c4\u2208T u\u03c4 X\u03c4 and Y = M.\n\n(cid:80)\n\n(cid:80)\n\ni\u2208Nd\n\nt\u2208Nt\n\n6\n\n\f4 Experiments\n\nIn this section we compared our proposed method with four other methods including (1) the\nLMNN method [23], (2) the Sparse Metric Learning via Linear Programming (SMLlp) [20], (3)\nthe information-theoretic approach for metric learning (ITML) [9], and (4) the Euclidean distance\nbased k-Nearest Neighbor (KNN) method (called Euc for brevity). We also implemented the iter-\native sub-gradient descent algorithm [24] to solve the proposed framework (4) (called SMLgd) in\norder to evaluate the ef\ufb01ciency of the proposed smooth optimization algorithm SMLsm. We try to\nexploit all these methods to learn a good distance metric and a KNN classi\ufb01er is used to examine\nthe performance of these different learned metrics.\nThe comparison is done on four benchmark data sets: Wine, Iris, Balance Scale, and Ionosphere,\nwhich were obtained from the UCI machine learning repository. We randomly partitioned the\ndata sets into a training and test sets by using a ratio 0.85. We then trained each approach on\nthe training set, and performed evaluation on the test sets. We repeat the above process 10 times\nand then report the averaged result as the \ufb01nal performance. All the approaches except the Eu-\nclidean distance need to de\ufb01ne a triplet set T before training. Following [20], we randomly gen-\nerated 1500 triplets for SMLsm, SMLgd, SMLlp, and LMNN. The number of nearest neighbors\nwas adapted via cross validation for all the methods in the range of {1, 3, 5, 7}. The trade-off\nparameter for SMLsm, SMLgd, SMLlp, and LMNN was also tuned via cross validation from\n{10\u22125, 10\u22124, 10\u22123, 10\u22122, 10\u22121, 100, 101, 102}.\nThe \ufb01rst part of our evaluations focuses on testing the learning accuracy. The result can be seen\nin Figure 1 (a)-(d) respectively for the four data sets. Clearly, the proposed SMLsm demonstrates\nbest performance. Speci\ufb01cally, SMLsm outperforms the other four methods in Wine and Iris, while\nit ranks the second in Balance Scale and Ionosphere with slightly lower accuracy than the best\nmethod. SMLgd showed different results with SMLsm due to the different optimization methods,\nwhich we will discuss shortly in Figure 1 (i)-(l). We also report the dimension reduction Figure 1(e)-\n(h). It is observed that our model outputs the most sparse metric. This validates the advantages\nof our approach. That is, our method directly learns both an accurate and sparse distance metric\nsimultaneously. In contrast, other methods only touch this topic marginally: SMLlp is not optimal,\nas they exploited the one-norm regularization term and also relaxed the learning problem; LMNN\naims to learn a metric with a large-margin regularization term, which is not directly related to sparsity\nof the distance matrix. ITML and Euc do not generate a sparse metric at all. Finally, in order to\nexamine the ef\ufb01ciency of the proposed smooth optimization algorithm, we plot the convergence\ngraphs of SMLsm versus those of SMLgd in Figure 1(i)-(l). As observed, SMLsm converged much\nfaster than SMLgd in all the data sets. SMLgd sometimes oscillated and may incur a long tail due\nto the non-smooth nature of the hinge loss. For some data sets, it converged especially slow, which\ncan be observed in Figure (k) and (l).\n\n5 Conclusion\n\nIn this paper we proposed a novel regularization framework for learning a sparse (low-rank) distance\nmatrix. This model was realized by a mixed-norm regularization term over a distance matrix which\nis non-convex. Using its special structure, it was shown to be equivalent to a convex min-max\n(saddle) representation involving a trace norm regularization. Depart from the saddle representation,\nwe successfully developed an ef\ufb01cient Nesterov\u2019s \ufb01rst-order optimization approach [16, 17] for our\nmetric learning model. Experimental results on various datasets show that our sparse metric learning\nframework outperforms other state-of-the-art methods with higher accuracy and signi\ufb01cantly smaller\ndimensionality. In future, we are planning to apply our model to large-scale datasets with higher\ndimensional features and use the line search scheme [15] to further accelerate the algorithm.\n\nAcknowledgements\n\nThe second author is partially supported by the Excellent SKL Project of NSFC (No.60723005),\nChina. The \ufb01rst and third author is supported by EPSRC grant EP/E027296/1.\n\n7\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(g)\n\n(h)\n\n(c)\n\n(f)\n\n(i)\n\n(j)\n\n(k)\n\n(l)\n\nFigure 1: Performance comparison among different methods. Sub\ufb01gures (a)-(d) present the aver-\nage error rates; (e)-(h) plots the average dimensionality used in different methods; (i)-(l) give the\nconvergence graph for the sub-gradient algorithm and the proposed smooth optimization algorithm.\n\n8\n\n0.74SMLsm1.95SMLgd1.48SMLlp2.48 ITML1.48 LMNN 3.7 EUC Wine\u2212Average Error Rate (%) 1.3SMLsm1.4SMLgd2.17SMLlp1.74 ITML1.74 LMNN 3.04 EUC Iris\u2212Average Error Rate (%) 8.19SMLsm9.21SMLgd18.62SMLlp6.81 ITML8.09 LMNN 20.11 EUC Bal\u2212Average Error Rate (%) 9.06SMLsm8.87SMLgd10.2SMLlp12.45 ITML10.19 LMNN 15.28 EUC Iono\u2212Average Error Rate (%) 5.7SMLsm9.3SMLgd12.1SMLlp13 ITML10.7 LMNN 13 EUC Wine\u2212Average Dim 1SMLsm1SMLgd4SMLlp4 ITML2 LMNN 4 EUC Iris\u2212Average Dim 3.3SMLsm3.7SMLgd4SMLlp4 ITML3.3 LMNN 4 EUC Bal\u2212Average Dim 4.9SMLsm11SMLgd15.1SMLlp33 ITML9.3 LMNN 33 EUC Iono\u2212Average Dim 010020030040050000.10.20.30.40.50.60.70.80.91EpochNormalized Objective ValuesConvergence Curve for Wine SMLgd SMLsm010020030040050000.10.20.30.40.50.60.70.80.91EpochNormalized Objective ValuesConvergence curves for Iris SMLgdSMLsm05010015020025030035040045000.10.20.30.40.50.60.70.80.91EpochNormalized Objective ValuesConvergence Curves for Balance SMLgdSMLsm010020030040050000.10.20.30.40.50.60.70.80.91EpochNormalized Objective ValuesConvergence Curves for Ionosphere SMLgdSMLsm\fReferences\n\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. NIPS, 2007.\n[2] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework\n\nfor multi-task structure learning. NIPS, 2008.\n\n[3] F. R. Bach. Consistency of trace norm minimization. J. of Machine Learning Research, 9:\n\n1019\u20131048, 2008.\n\n[4] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and\n\nExamples. CMS Books in Mathematics. Springer, 2005.\n\n[5] S. Boyd and L . Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n[6] J. F. Bonnans and A. Shapiro. Optimization problems with perturbation: A guided tour. SIAM\n\nReview, 40:202\u2013227 ,1998.\n\n[7] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from\n\nequivalence constraints. J. of Machine Learning Research, 6: 937-965, 2005.\n\n[8] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively with ap-\n\nplication to face veri\ufb01cation. CVPR, 2005.\n\n[9] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. ICML,\n\n2007.\n\n[10] G. M. Fung, O. L. Mangasarian, and A. J. Smola. Minimal kernel classi\ufb01ers. J. of Machine\n\nLearning Research, 3: 303\u2013321, 2002.\n\n[11] A. Globerson, S. Roweis, Metric learning by collapsing classes. NIPS, 2005.\n[12] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component anal-\n\nysis. NIPS, 2004.\n\n[13] T. Hastie, R.Tibshirani, and Robert Friedman. The Elements of Statistical Learning. Springer-\n\nVerlag New York, LLC, 2003.\n\n[14] R.A. Horn and C.R. Johhnson. Topics in Matrix Analysis. Cambridge University Press, 1991.\n[15] A. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture Notes, 1994.\n[16] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003.\n[17] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103:127-152, 2005.\n\n[18] Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection\n\nfor multiple classi\ufb01cation problems. Statistics and Computing. In press, 2009.\n\n[19] J. D. M. Rennie, and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. ICML, 2005.\n\n[20] R. Rosales and G. Fung. Learning sparse metrics via linear programming. KDD, 2006.\n[21] N. Srebro, J.D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. NIPS,\n\n2005.\n\n[22] L. Torresani and K. Lee. Large margin component analysis. NIPS, 2007.\n[23] K. Q. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest\n\nneighbour classi\ufb01cation. NIPS, 2006.\n\n[24] K. Q. Weinberger and L. K. Saul. Fast solvers and ef\ufb01cient implementations for distance\n\nmetric learning. ICML, 2008.\n\n[25] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with application to\n\nclustering with side information. NIPS, 2002.\n\n[26] L. Yang and R. Jin. Distance metric learning: A comprehensive survey. In Technical report,\n\nDepartment of Computer Science and Engineering, Michigan State University, 2007.\n\n9\n\n\f", "award": [], "sourceid": 304, "authors": [{"given_name": "Yiming", "family_name": "Ying", "institution": null}, {"given_name": "Kaizhu", "family_name": "Huang", "institution": null}, {"given_name": "Colin", "family_name": "Campbell", "institution": null}]}