Very well written paper proposing an interesting algorithm to solve (lossless and lossy) graph summarization problems (cf problem definition). The algorithm is parallelizable in a shared memory environment and can be distributed using mapreduce. The theoretical analysis is sound and shows that the algorithm has a linear time complexity in practical scenarios. The experimental evaluation shows that the algorithm achieves a better trade-of between time and quality than existing approaches. In particular, the algorithm is faster than the greedy heuristic while leading to similar results. The proposed algorithm can be seen as a scalable implementation of the greedy heuristic: using shingle hashing of nodes allows to prune the search space and consider only some relevant pairs of nodes instead of considering all pairs of nodes (or all pairs of nodes sharing at least one neighbor). ### No theoretical approximation guarantees: The graph summarization problem is well defined and has an optimal solution. The proposed algorithm does not seem to have any theoretical approximation guarantees. According to the experimental evaluation, the quality of the output is similar to the one of the (straightforward) greedy heuristic, but we do not know how far from optimality it is. Is there any algorithm for the problem with some theoretical approximation guarantees? ### Absolute performance: While the performance relatively to other algorithms is good, the absolute performance of the algorithm is somehow disappointing. Figure 3: we see that the size of the output graphs is larger than 40% the size of the input graphs (in terms of (super) edges) for all graphs (and in many cases larger than 70%) except for the two web graphs (it is known that web graphs can be compressed efficiently (cf the webgraph framework)) and surprisingly a protein interaction graph (only 10% of edges are kept).
Great paper! ### Input graph in main memory It seems that to compute the suggested ordering (later used to compress the input graph), the graph needs to be stored in the main memory (as an adjacency list). However, if the graph already fits in the main memory of the machine, then compressing it is less interesting. Some experiments are carried on huge graphs that do not fit in the main memory of the machine if not compressed. The trick is that the web graphs are already compressed (maybe with the lexicographic url ordering) to compute the suggested ordering, while the considered social networks are actually not that large and fit in the main memory of a commodity machine. Footnote 21: "It is possible in principle to avoid keeping the graph in the main memory, but the cost becomes $O(n \log n)$.". How can I do that? ### Heuristic to minimize the average gap cost For social networks, it is shown that the compression is highly correlated to the average gap cost (average log gaps) if the "intervalisation" of the BV framework is turned off. The authors note that the suggested ordering is excellent at minimizing this average gap cost. And that even though it does not seem to minimize it directly. Can a heuristic that is explicitly designed to minimize this average gap cost lead to a better compression? ### Typos: - ref lacking: "label propagation [RAK07, ?]" - "until it is possible to do so" -> "until it is not possible to do so" - ref lacking: "Absolute Pott Model (APM) [?]" - "tecniques" - "Some simple experiments not reported here shows that the same happen" -> "Some simple experiments not reported here show that the same happens"