☆

1

## Great overview of graph embedding methods
### The encoder, decoder and loss function formalism is very well designed!
### Similarity matrix:
In Table 1, maybe there are methods dedicated to the case where the similarity matrix is required to be a positive semidefinite (SDP) matrix.
Indeed, (i) there are plenty of such relevant SDP matrices (such as "$D+A$", "$A*A$" or "the cosine similarity" which are all SDP) and (ii) the optimal embedding minimizing the L2 loss function is easy to find using the eigenvectors associated with the largest eigenvalues. It thus makes this kind of method very handy.
This kind of [kernel PCA](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis)-like method does not work if the similarity matrix (somehow the kernel here) is not SDP.
### Shallow embedding VS autoencoder-based embedding:
Limitations of "shallow embedding methods" are highlighted on page 9 and it seems that autoencoder-based methods answer some of them. But this part is not totally clear to me.
From what I understand, the major difference between the two kinds of methods is that:
- for the methods referred to as "shallow embedding methods", if you want to reconstruct the entry $s(i,j)$ of the input similarity matrix, then you need the learned feature vectors of both nodes $i$ and $j$ and then the reconstruction of $s(i,j)$ is often simple, e.g. the scalar product between the two feature vectors (cf. equation (5)) or sometimes a softmax (cf. equation (10)) is added on top of the scalar product;
- while in the case of autoencoder-based embedding methods, you only need the learned feature vector of a single node $i$ to reconstruct the full column $s_i$ of the input similarity matrix and the reconstruction is more elaborated.
We can notice that no parameters are shared between nodes in the decoder for the shallow methods, while some parameters can be shared in the case of autoencoder-based methods. There thus can be fewer parameters in autoencoder-based methods and thus it could be in principle faster and have some regularisation.
In addition, if a previously unseen node arrives, an autoencoder-based method could in principle still give a feature vector for that node, while it does not seem to be the case for shallow methods (they are referred to as "inherently transductive" methods).
"Shallow embedding also fails to leverage node attributes during encoding". I think that it is possible to adapt both shallow methods and autoencoder-based methods to take into account this additional information on node attributes. I didn't get this point.
As noted in the survey, it seems that current autoencoder-based methods (SDNE and DNGR) are not scalable and are inherently transductive: "the input dimension to the autoencoder is fixed at $|V|$, which can be extremely costly and even intractable for graphs with millions of nodes. In addition, the structure and size of the autoencoder is fixed, so SDNE and DNGR are strictly transductive and cannot cope with evolving graphs, nor can they generalize across graphs."
The scalability issue has also been noted in [VERSE](https://papers-gamma.link/paper/48/):
"Works such as [12, 48] investigate deep learning approaches for graph embeddings. Their results amount to complex models that require elaborate parameter tuning and computationally expensive optimization, leading to time and space complexities unsuitable for large graph" where [12] and [48] are SDNE and DNGR.
### Neighborhood aggregation and convolutional encoders:
This part seems important. In particular, Algorithm 1 seems important as it is referred to in subsequent parts. However, I found this section very hard to understand.
### Incorporating task-specific supervision:
Even though this section is short, I find it very interesting.
### Typos:
- "they found the the more complex aggregators"
- In equation (24) $h_j^{k-1}$
- End of page 20. What is $\Epsilon$ in $O(|\Epsilon|)$?
- "(e.g. , using..."
- "[8] J. Bruna, W. Zaremba, and Y. Szlam, A.and LeCun. Spectral networks and locally connected networks on graphs..."

☆

1

Great SDP relaxation of max-k-cut and max-bisection: very inspiring!
### Complicated analysis:
The analysis to prove the approximation guarantee is quite complicated though. Much more complicated than the one of [the Goemans-Williamson algorithm](https://en.wikipedia.org/wiki/Semidefinite_programming#Example_3) for max-cut.
Is a simpler analysis possible? Or another relaxation leading to a simpler analysis having similar or better approximation guarantees?
### Implementing the algorithms in a scalable way:
How can we implement such an algorithm in a scalable way? Say for a sparse graph with 1M nodes and 100M edges?
For Goemans-Williamson, this is a try: https://github.com/maxdan94/spingraphSDP

## Comments: