VERSE: Versatile Graph Embeddings from Similarity Measures
Uploaded by: Open Reading Group
Upload date: 2018-05-02 15:19:41
Edited at: 2018-05-14 20:48:22
Edited by: Sergey Kirgizov

Comments:

### Nice validation of the obtained embedding: The obtained embedding seems useful for: (i) node clustering, (ii) link prediction and (iii) node classification. ### Concerns about the scalability of Louvain: The Louvain algorithm can process Orkut in a few seconds on a commodity machine, contrary to what Table 2 suggests. The obtained modularity is of 0.68 such as written in Table 3 (the table presenting the dataset). The value of 0.68 is much more than the modularity obtained thanks to embeddings and k-means. ### Table 2 reporting the average and worst case running time is confusing: In Table 2, $\Theta$ is used to denote the average running time and $O$ is used to denote the worst case running time. This is very confusing as worst-case running time can be given in terms of big-oh or theta. Definitions are given here: https://en.wikipedia.org/wiki/Big_O_notation It is also not clear how these asymptotic complexities were obtained, in particular, the asymptotic time and memory complexities of the suggested algorithm are not proven.
### VERSE Embedding Model "We define the unnormalized distance between two nodes u,v in the embedding space as the dot product of their embeddings Wu·W⊤v. The similarity distribution in the embedded space is then normalized with softmax." What is the intuition behind the softmax?
Hey! > ### Concerns about the scalability of Louvain: > > The Louvain algorithm can process Orkut in a few seconds on a commodity machine, contrary to what Table 2 suggests. The obtained modularity is of 0.68 such as written in Table 3 (the table presenting the dataset). The value of 0.68 is much more than the modularity obtained thanks to embeddings and k-means. In the paper, we used the python-louvail package from pip: https://github.com/taynaud/python-louvain It took significant time (~30m) on YouTube network, probably it is less scalable than a more optimized version one could use. In any case, this experiment should tell more about the type of structures we find with clustering the embeddings with k-means rather than providing a competitor for direct modularity optimization. Perhaps, an interesting research direction would be something like embedding-guided modularity optimization. ;) > ### Table 2 reporting the average and worst case running time is confusing: > > In Table 2, $\Theta$ is used to denote the average running time and $\mathcal{O}$ is used to denote the worst case running time. This is very confusing as worst-case running time can be given in terms of big-oh or theta. > Definitions are given here: https://en.wikipedia.org/wiki/Big_O_notation > > It is also not clear how these asymptotic complexities were obtained, in particular, the asymptotic time and memory complexities of the suggested algorithm are not proven. Let me try to explain the complexities in a different way. The "average case" complexities are given for a "normal" sparse scale-free network with relatively low average degree. The "worst case" complexities are actual worst case analyses. For sampled VERSE, we only need to store the embedding: one n*d matrix and a graph. The runtime complexity is determined by the number of samples per node, which we did not change in the experiments. We had an experiment for varying this number, but it had to be cut down due to the space constraints, unfortunately. Hope that helps. > "We define the unnormalized distance between two nodes u,v in the embedding space as the dot product of their embeddings Wu·W⊤v. The similarity distribution in the embedded space is then normalized with softmax." > > What is the intuition behind the softmax? We want to compare similarity distributions, as in "probability distributions". In order to do that, the distributions must be normalized. Softmax normalization has some nice computational and optimization properties. By using it, we explicitly assume the outputs of the network are log-probabilities of the target classes (output nodes). Best, Anton
Stranger at 2018-05-14 18:23:43
Edited by Sergey Kirgizov at 2018-05-14 20:33:17
Hi Anton, currently it's not possible for anonymous users to edit their comments (but we work on this!). So, I allowed myself to correct some glitches in the math notation inside your post. Have a nice day!
Thanks for your answer Anton. The original version of Louvain in C++, as well as newer C++ versions, can process all these graphs within a few seconds. The code is available here: https://sites.google.com/site/findcommunities/ The python version is not intended to be scalable.
> Thanks for your answer Anton. > The original version of Louvain in C++, as well as newer C++ versions, can process all these graphs within a few seconds. The code is available here: https://sites.google.com/site/findcommunities/ > The python version is not intended to be scalable. Just a few quick remarks: - One should not use the implementation from sites.googles.com but rather the one from sourceforge (there is a large red warning on the sites.google.com home page). Unfortunately the paper is 10 years old and the URL cannot be changed. - It is necessary to use the quality threshold option to obtain a good "time vs accuraccy" compromise when dealing with large graphs. A .005 value is kind of good: it really fasten the computation with nearly no quality loss. With such a choice of values, Orkut computation takes a few hundred seconds. - Since Louvain is not deterministic, a very precise a long computation (with an extremely low threshold) can be beaten in terms of modularity by a very fast computation (with a high threshold). In any case a set of experimentations should always be performed and average/percentile/max/... values should be displayed instead of one raw value.
### Embeddings to find communities: I would like to point to a very different, but related approach: http://web.stanford.edu/~montanar/SDPgraph/
Maximimi at 2018-05-25 16:55:52
Edited by Maximimi at 2018-05-25 20:24:04
### Concerns about the justification of using KL divergence: "As an optimization objective, we aim to minimize the Kullback-Leibler (KL) divergence from the given similarity distribution to that of the embedded space" "We illustrate the usefulness of this objective using a small similarity matrix. Figure 2 shows (a) the Personalized PageRank matrix, (b) the reconstruction of the same matrix by VERSE, and (c) the reconstruction of the same matrix using SVD. It is visible that the nonlinear minimization of KL divergence between distributions preserves most of the information in the original matrix, while the linear SVD-based reconstruction fails to differentiate some node" I think that this small example shows that the overall VERSE method is better than SVD. But, it does not specifically show why using the KL divergence as an objective to minimize is more relevant than using another objective. Is it relevant to use another dissimilarity between distributions? Also is the result of the optimization optimal? Or can we get stuck in a local minimum? In the latter case, does it make sense to run the optimization from scratch several times and take the result that minimizes the objective?
Maximimi at 2018-05-25 17:53:51
Edited by Maximimi at 2018-05-25 21:19:27
### Softmax regression: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/
### Similarity matrix: In the paper, "similarity matrix" is used to refer to the input matrix. This might not be appropriate as one may think that the matrix needs to be positive semidefinite which is not the case.
### Link to word2vec: Here is an alternative approach: given a matrix related to the input graph and such that each row sums to one (e.g. the matrix where $A_{ij}=\frac{1}{d_i}$), use word2vec where the "context" of a node is its corresponding row. I came up with this after reading this nice tutorial: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ I think that the approach suggested in the paper is very similar to the one I suggest above. What are the differences?
Hello all, I will try to give the answers to some of the questions that were raised. > I think that this small example shows that the overall VERSE method is better than SVD. But, it does not specifically show why using the KL divergence as an objective to minimize is more relevant than using another objective. Well, one can comprehend VERSE as a matrix decompositions with KL objective function for specific type of matrices. Then, the only difference is optimization (for SVD we have a "closed-form" solution; for VERSE, we use SGD), and the objective function ($KL$ or $L_2$). Then, either the "closed-form" SVD result is far from the optimum (it's not), the problem is in the objective. > Is it relevant to use another dissimilarity between distributions? Sure! We picked the most efficient one that allows for connections wit hother embedding methods, but surely you can so something for gradient-based EMD optimization, assuming your similarity function is smooth (Lipschitz continuous). > Also is the result of the optimization optimal? Or can we get stuck in a local minimum? In the latter case, does it make sense to run the optimization from scratch several times and take the result that minimizes the objective? I believe there are many local optimas for the model, but the general space geometry is almost all of them is disentangled (=embeddings are good). In fact, there are infinitely many solutions (https://arxiv.org/abs/1704.08059). I would assume that adding weight regularization would help to converge to more stable solution. > I think that the approach suggested in the paper is very similar to the one I suggest above. What are the differences? Essentially, you have (re-)invented LINE (https://arxiv.org/abs/1503.03578). For this work, we are more interested in the whole distibution of similarities, as in PageRank. We show that it is not enough to just consider neighbor nodes as "context" if you want good embedding quality. In fact, we suggest that it is useful to abstract completely from the word2vec notion of context, and just use similarities instead. After we agree on the notions, a reasonable question is "is it word2vec when you force a user to select the similarity (context distribution). The answer is, of course, yes and no :). We show in the paper that, even though thinking about similarities eases the understanding of different models, negative sampling used in LINE/node2vec/... does not optimize for the similarity! This is essentially because negative sampling converges to something different than the true estimator for multiclass logistic regression (aka softmax regression). Best, Anton
Thanks for your answers and insight Anton. >> I think that the approach suggested in the paper is very similar to the one I suggest above. What are the differences? > > Essentially, you have (re-)invented LINE (https://arxiv.org/abs/1503.03578). If you see the matrix $W$ used in LINE as the similarity matrix you use in VERSE, then yes I think that what I have suggested is similar to LINE. So VERSE can be seen as using LINE on the input similarity matrix? Is that correct?
> We show in the paper that, even though thinking about similarities eases the understanding of different models, negative sampling used in LINE/node2vec/... does not optimize for the similarity! This is essentially because negative sampling converges to something different than the true estimator for multiclass logistic regression (aka softmax regression). I've missed that. You mean that in LINE, "negative sampling" is used for the optimization; while in VERSE, "Noise Contrastive Estimation" is used?
> Perhaps, an interesting research direction would be something like embedding-guided modularity optimization. Maybe using the modularity-matrix (https://en.wikipedia.org/wiki/Modularity_(networks)#Matrix_formulation) as the similarity matrix used in VERSE? This matrix can have negative entries though...
Hi! > If you see the matrix $W$ used in LINE as the similarity matrix you use in VERSE, then yes I think that what I have suggested is similar to LINE. So VERSE can be seen as using LINE on the input similarity matrix? Is that correct? VERSE can be seen as going beyond this ;) If we just plug similarities into LINE training objective, we observe that (Fig. 3, NS line) the objective does not recover similarities at all! This is very surprising, and suggests that current embedding methods such as node2vec or LINE do not optimize the actual objective they claim to optimize. When I discovered that, I thought this is just a bug in the code, so I have made a version of VERSE that optimizes the random walk based similarity from DeepWalk. It turned out that DeepWalk does not preserve the similarity ranking of these walks. > I've missed that. You mean that in LINE, "negative sampling" is used for the optimization; while in VERSE, "Noise Contrastive Estimation" is used? Yes! To overcome the aforementioned problem, we have proposed an NCE-based sampling solution. It seems to work much better than plain negative sampling, and show that it is close to the full neural factorization of the matrix (which is only possible for VERSE model)! > >> Perhaps, an interesting research direction would be something like embedding-guided modularity optimization. > > Maybe using the modularity-matrix (https://en.wikipedia.org/wiki/Modularity_(networks)#Matrix_formulation) as the similarity matrix used in VERSE? > This matrix can have negative entries though... Yes, modularity matrix would probably not suit the algorithm well, as it is not probabilistic per node. Maybe some reformulation would help, I am not aware of any to this day.

Please consider to register or login to comment on the paper.