Very nice paper! The main idea: if the distance between two embedded nodes has to be large then one should not waste energy in order to make this distance a bit smaller. Trying to make large distances a bit smaller is actually what Laplacian embedding is doing with its quadratic objective. One should rather focus on small distances and give up on the pairs of embedded nodes that are far away. ### Embedding complex networks: This method could be applied to social networks (or complex networks in general): social networks are known to have "weak ties" (roughly speaking, weak ties are links between people that do not have many friends in common) that may perturb laplacian embedding. ### Asymptotic complexity and scalability in practice: The suggested algorithm to solve the optimization does not seem very scalable. The asymptotic running time complexity is not given. ### Concerns about the optimization in dimension higher than 1. The example in 2-D. In equations (9-11), what if we change the constraints into: $||x||^2+||y||^2=1$, $e^Tx=0$ and $e^Ty=0$? In n-D, denoting $r_i$ the vector associated to node $i$, that would give $\sum_ir_i^2=1$ and $\sum_ir_i=0$. This seems to be a more relevant optimization to me: minimize the distance between nodes in 2-D without setting constraints for each dimension, but rather global constraints. If the objective is the one of the laplacian embedding, is it true that, in the optimal solution, $y$ will be proportional to $x$? ($x$ and $y$ will both be proportional to the eigenvector associated with the smallest eigenvalue of the Laplacian?) If so, is it still true if the objective is another one? Like the Cauchy embedding objective for instance? ### Others: - How to tune the parameter $\sigma$?
> - How to tune the parameter $σ$? I don't see where $σ$ is defined, did I miss something?
Note that $\sum_i r_i = 0$ means that vector $r$ is orthogonal to vector $v_0 = (1,1,\ldots\,1,1,1)$. The latter corresponds to the smallest eigen-value $\lambda_0 = 0$ of [Laplacian matrix](https://en.wikipedia.org/wiki/Laplacian_matrix). Adding constraint $r \perp v_0$ to the optimization problem drives us to seek the second smallest eigenpair. You may also read very related and well-written [lecture](https://ocw.mit.edu/courses/mathematics/18-409-topics-in-theoretical-computer-science-an-algorithmists-toolkit-fall-2009/lecture-notes/MIT18_409F09_scribe3.pdf) about Rayleigh quotients, min-max theorem (Courant–Fischer–Wey). Gives a good background in the field of eigen-based solutions for optimization problems. In 2016, during the preparation of [my talk](http://www.complexnetworks.fr/temporal-density-of-complex-networks-and-ego-community-dynamics/) titled "Temporal density of complex networks and ego-community dynamics" ([slides](https://kirgizov.link/talks/temporal-density-4-july-2016.pdf), french). I have read stuff about graph optimization problems. Below I present my personal (perhaps a little bit random selection) of related papers and books * [Four Cheeger-type Inequalities for Graph Partitioning Algorithms](https://www.math.ucsd.edu/~fan/wp/heaticcm.pdf), Fan Chung, 2007 * [Normalized cuts and image segmentation](https://people.eecs.berkeley.edu/~malik/papers/SM-ncut.pdf), Jianbo Shi and Jitendra Malik, 2000 * [Eigenvalues of graphs](https://web.cs.elte.hu/~lovasz/eigenvals-x.pdf), László Lovász, 2007 * [Spectres de graphes](https://www-fourier.ujf-grenoble.fr/~ycolver/All-Articles/98a.pdf) de Yves Colin de Verdière, 1998
## Great overview of graph embedding methods ### The encoder, decoder and loss function formalism is very well designed! ### Similarity matrix: In Table 1, maybe there are methods dedicated to the case where the similarity matrix is required to be a positive semidefinite (SDP) matrix. Indeed, (i) there are plenty of such relevant SDP matrices (such as "$D+A$", "$A*A$" or "the cosine similarity" which are all SDP) and (ii) the optimal embedding minimizing the L2 loss function is easy to find using the eigenvectors associated with the largest eigenvalues. It thus makes this kind of method very handy. This kind of [kernel PCA](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis)-like method does not work if the similarity matrix (somehow the kernel here) is not SDP. ### Shallow embedding VS autoencoder-based embedding: Limitations of "shallow embedding methods" are highlighted on page 9 and it seems that autoencoder-based methods answer some of them. But this part is not totally clear to me. From what I understand, the major difference between the two kinds of methods is that: - for the methods referred to as "shallow embedding methods", if you want to reconstruct the entry $s(i,j)$ of the input similarity matrix, then you need the learned feature vectors of both nodes $i$ and $j$ and then the reconstruction of $s(i,j)$ is often simple, e.g. the scalar product between the two feature vectors (cf. equation (5)) or sometimes a softmax (cf. equation (10)) is added on top of the scalar product; - while in the case of autoencoder-based embedding methods, you only need the learned feature vector of a single node $i$ to reconstruct the full column $s_i$ of the input similarity matrix and the reconstruction is more elaborated. We can notice that no parameters are shared between nodes in the decoder for the shallow methods, while some parameters can be shared in the case of autoencoder-based methods. There thus can be fewer parameters in autoencoder-based methods and thus it could be in principle faster and have some regularisation. In addition, if a previously unseen node arrives, an autoencoder-based method could in principle still give a feature vector for that node, while it does not seem to be the case for shallow methods (they are referred to as "inherently transductive" methods). "Shallow embedding also fails to leverage node attributes during encoding". I think that it is possible to adapt both shallow methods and autoencoder-based methods to take into account this additional information on node attributes. I didn't get this point. As noted in the survey, it seems that current autoencoder-based methods (SDNE and DNGR) are not scalable and are inherently transductive: "the input dimension to the autoencoder is fixed at $|V|$, which can be extremely costly and even intractable for graphs with millions of nodes. In addition, the structure and size of the autoencoder is fixed, so SDNE and DNGR are strictly transductive and cannot cope with evolving graphs, nor can they generalize across graphs." The scalability issue has also been noted in [VERSE](https://papers-gamma.link/paper/48/): "Works such as [12, 48] investigate deep learning approaches for graph embeddings. Their results amount to complex models that require elaborate parameter tuning and computationally expensive optimization, leading to time and space complexities unsuitable for large graph" where [12] and [48] are SDNE and DNGR. ### Neighborhood aggregation and convolutional encoders: This part seems important. In particular, Algorithm 1 seems important as it is referred to in subsequent parts. However, I found this section very hard to understand. ### Incorporating task-specific supervision: Even though this section is short, I find it very interesting. ### Typos: - "they found the the more complex aggregators" - In equation (24) $h_j^{k-1}$ - End of page 20. What is $\Epsilon$ in $O(|\Epsilon|)$? - "(e.g. , using..." - "[8] J. Bruna, W. Zaremba, and Y. Szlam, A.and LeCun. Spectral networks and locally connected networks on graphs..."