☆

1

### Style
I like the Levin style, he writes a bit provocative phrases that looks, in the same time, incredibly truthful (at least for the author himself). Consider for example what he says about the invention of positional numeral systems and Quantum computers:
> Archimedes made a great discovery that digital representation of numbers is exponentially
> more efficient than analog ones (sand pile sizes). Many subsequent analog devices yielded
> unimpressive results. It is not clear why QCs [quantum computers] should be an exception.
### One-way function and the axiom of choice
One-way functions and the axiom of choice deals with practical
or conceptual (im-)possibility to solve [inverse problems](https://en.wikipedia.org/wiki/Inverse_problem) arising in Computer Science and Mathematics.
### Kolmogorov complexity and One-way functions
Two primes $p$ and $q$ have almost the same [informational content](https://en.wikipedia.org/wiki/Algorithmic_information_theory) as their product $pq$.
However, the [$p, q \mapsto pq$](https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Arithmetic_functions) is much more easy to do than
[$pq \mapsto p,q$](https://en.wikipedia.org/wiki/Integer_factorization#Prime_decomposition).

☆

1

## Interesting paper!
Code is coming soon: http://spcl.inf.ethz.ch/Research/Performance/LogGraph/
The method can reduce the size needed to store the input graph by 35% compared to the adjacency array format, while not making computations slower.
Several compression frameworks are suggested in the paper. All of them use fixed length compression codes contrarily the WebGraph framework.
One of the compression frameworks leads to a space of $\sum_{v\in V} \log(\widehat{N_v})\cdot d_v+\log(\log(\widehat{N_v}))$ bits (cf. section 3.2, $\widehat{N_v}$ is the maximum ID of the neighbors of node $v$) as it uses $\log(\widehat{N_v})$ bits to store the $d_v$ neighbors of each node $v$ and $\log(\log(\widehat{N_v}))$ bits to store the length of the codes used for each node $v$. Note that an additional cost is needed in order to have a direct access to the neighbors of each node (this is done using succinct bit vectors (it seems to lead to a total number of bits close to $n\cdot \log(m)$)). The nodes can be renumbered such that the number of bits is minimized.
### Approaches the compression ratio of WebGraph
I read that the suggested compression method "approaches the compression ratio of the established WebGraph compression library". What does it mean exactly? The Webgraph compression library can compress web graphs using less than 2 bits per link. The suggested compression framework seems to be far from that compression ratio.
So I think that what is meant is that on graphs other than web graphs the suggested method approaches the compression ratio of the Webgraph library. Am I correct?
### Java VS C++
I read "We use the WebGraph original tuned Java implementation for gathering the data on compression ratios but, as Log(Graph) is developed in C++, we use a proof-of-a-concept C++ implementation of WG schemes for a fair C++ based performance analysis.".
I do not understand. This is because the original Java implementation is slower than a C++ implementation? Which C++ implementation was used for WG?
### Cost function in section 3.6
The quantity to minimize should be $\sum_{v\in V} \log(\widehat{N_v})\cdot d_v$ (cf. section 3.2, $\widehat{N_v}$ is the maximum ID of the neighbors of node $v$), that is intuitively minimizing $\widehat{N_v}$ for nodes with high $d_v$.
It is not clear to me why this other counter-intuitive objective is used: $\sum_{v\in V} \widehat{N_v}\cdot \frac{1}{d_v}$.
### Webgraph decompression impacts running time
I agree that referencing nodes can impact the running time (especially if there are chains of references). However, it is possible to use the WebGraph framework without using referencing, that is only using gap encoding with variable length instantaneous codes.
I think that doing that does not impact significantly the running time, does it?
### Typos and minors
- Section 4.2.: "There are exactly $Q$ bit vectors of length with n ones and the storage lower bound is $Q$ bits." -> the storage lower bound is $\log(Q)$ bits.

Some people discuss the paper (and troll the paper authors) on hackernews:
https://news.ycombinator.com/item?id=18081978

☆

1

The method uses random projections of the column vectors of the matrix $S=(\alpha_0 I+\alpha_1 A+\alpha_2 A^2+...+\alpha_q A^q)$. It is shown that such a random projection is a heuristics to minimize a relevant cost function (equation (2)) with some guarantees (Theorem 1).
As only the products of a set of random vectors $R$ with the matrix $S$ is needed, explicitly forming the (potentially dense) matrix $S$ is avoided: only the vectors $A^p\cdot R$ are computed. The running time is $O(n\cdot d^2+ m\cdot q\cdot d)$, where $m$ is the number of edges, $n$ is the number of nodes, $d$ is the dimensions of the embedding (and the number of random vectors in $R$) and $q$ is the order of the embedding (cf formula of $S$). The operations involved are very simple leading to a very fast practical algorithm.
It is shown in the experiments that these simple ideas lead to the most scalable embedding method and that the method is still accurate for some machine learning tasks such as classification, link prediction, and network reconstruction.
The executable in MATLAB is publicly available here: https://github.com/ZW-ZHANG/RandNE
An alternative open source implementation in C is available here: https://github.com/maxdan94/RandNE
### Concerns regarding Theorem 3
"Theorem 3. The time complexity of Algorithm 3 is linear with respect to the number of changed nodes and the number of changed edges respectively."
In the proof, the running time proven is in $O(M'\cdot R(q))$ (where $M'$ is the number of new or disappeared edges), the "$R(q)$" is the $q$-step neighborhood of an edge. We can have $R(q)=O(N)$ for some graphs especially in real-world graphs where the average distance between two nodes is very small (say less than 4) https://research.fb.com/three-and-a-half-degrees-of-separation/. The theorem is thus not correct without stating further assumptions.
In addition, in the experiments, the running time of the suggested algorithm for dynamic graphs (Algorithm 3) is not compared to the static algorithm (Algorithm 1) re-run from scratch after each update. It is not clear to me if the gain in time is significant.
### I do not understand 5, C, 1
"Network Reconstruction: The experimental setting is similar to moderate-scale networks (Section V-B2), i.e. we rank pairs of nodes according to their inner product similarities and reconstruct the network. We report the AUC scores in Table V."
There are approximately $\approx 10^{16}$ such pairs of nodes. This number of pairs of nodes is too large, I do not understand how the pairs can be ranked and the AUC computed.
### Parametters tuning
The method needs many parameters: $d$, $q$, $a_0$, $a_1$, ..., $a_q$. Whyle $d$ and $q$ can be arbitrary fixed to say $d=128$ and $q=3$ (as done in the paper). It is not clear what default values would fit for the $a_k$'s.
In the paper, the $a_k$'s are found using grid search. Note that the range of values and the steps are not specified in addition the optimal values are not given. This hurts reproducibility.
What if grid search is not possible (e.g. no train set)? Which default value would make sense?
### Grid search and unfair comparison
In the experiments, the suggested embedding is shown to lead to better results for some tasks (such as network reconstruction, classification and link prediction) compared to other embedding methods. However, the suggested method has parameters ($a_0$, $a_1$, $a_2$ and $a_3$ since $q$ is set to $3$). These parameters are tuned using grid search, while some of the parameters of the other methods seem to be kept to default. This may not be a very fair comparison.
### Real billion-scale networks?
I read "We also report the exact running time in Table
VII. It shows that RandNE can learn all the node embeddings
of WeChat within 7 hours with 16 normal servers, which is
promising for real billion-scale networks."
I do not understand what is meant. WeChat has 250M nodes and 4.8G edges. The number of edges is thus in the order of billions. "Real billion-scale networks" means more than 1G nodes and not edges?
Another graph with 1G edges is available here: http://snap.stanford.edu/data/com-Friendster.html
In addition, contrary to WeChat, nodes have some labels that could be used to test the relevance of the embedding.
Graphs with billion nodes are available here: http://law.di.unimi.it/datasets.php
Unfortunately, they are directed...
Another one is here: https://sparse.tamu.edu/Sybrandt/MOLIERE_2016 undirected, but weighted.
A twitter graph with 25G edges can be obtained upon request: https://hal.inria.fr/hal-00948889/en/
###Typos and minors:
- "as an theoretical guarantee"
- "an one-vs-all logistic regression"
- This relevant WWW2018 paper is not cited and the associated method is not used for comparisons: https://papers-gamma.link/paper/48

## Comments: