Thursday, December 23, 2010

Community Detection

Of all topics in general network theory, my favorite is community detection. For some reason I can't stop reading about new algorithms, analyzing them and trying them on my data.

So what is community detection? Every network can be subdivided into sub-units which are groups of closely connected nodes (or individuals). In animal populations, communities can be groups of animals, couples or even solitary individuals. As usual, the categorization is dependent on the definition of a connection in the network. In the past researchers were using their own intuition or very basic calculations in order to decide which individual belongs to each group. While in some species its simple and clear to make this decision, in others, like the hyrax, it's not trivial. Some hyraxes are clearly part of a group, while others are occasionally seen out of the group, or even with another group. In these cases, it's not easy to decide to what group they belong. This decision has many consequences, as any analysis at the group level, or comparison of individuals from different groups, will depend on the definition of group members.

Community detection algorithms try to solve this problem by taking into account the connections (edges) each node has with all other nodes. There are many algorithms out there, reviewed by Fortunato. I'm not going to exhaustively review them - I would like to mention 3 algorithms I found to do an accurate job, that are relatively easy to use. One thing that is important to the mathematicians who invent most of these algorithms is their complexity, or the time it will take to run. For animal researches this is usually not a real problem, as the numbers of individuals/relations are small. Therefore, I don't care if an algorithm is heavy on resources.

The first algorithm, and one of the most famous in the field, is called Girvan-Newman, and was published in PNAS in 2002. This algorithm computes the betweenness centrality of each edge, and then removes the edge with the highest betweenness, assuming that it connects communities and cannot be inside a community. In the next step the betweenness of all edges is recalculated, and again the edge with the highest value is removed. When the algorithm stops the result is a tree describing the levels of connectedness. Later, a stop mechanism was developed, based on a measure called modularity, to try and decide when should the algorithm stops if the "right" number of communities is unknown. Girvan-Newman can be calculated using UCINet. Although it performs very well, there is some critique on the use of modularity, and newer algorithms were developed.

The second algorithm is called CPM, or clique percolation method (published in Nature in 2005). This algorithm works in a different way, trying to connect more and more cliques of closely related nodes. In contrast to the Girvan-Newman algorithm, this one allows for community overlap, so that one node can be part of more than one community. It works by starting with cliques of a given size k, and then adding to them the members of other cliques that share k-1 nodes, until no more cliques can be added. One disadvantage is that you have to define k, or test which k gives the best results, which brings us back to the question of what is the best result. Another drawback is its inability to find communities of size 2, as cliques contain 3 or more members. There are versions of this algorithm that work on directed or weighted networks. A software named CFinder implements this algorithm.
A result of CPM
The last algorithm for today is called Link Communities and was published in Nature in 2010. This one also allows for community overlap. Its idea is that many nodes belong to more than one community (for example, a person might be part of his family, his workmates, his friends at the bar, etc), but the links between individuals usually represent only one kind of connection. The algorithm gets complex in finding the right links to join, and produces very accurate results. It has no resolution problem, meaning you can find any number of communities you wish, but calculates the best answer in terms of partition density. Rob Spencer from Scaled Innovation created a beautiful implementation, which shows the communities of each node, and the partition density function:

All of these algorithms performed well on my hyrax data, usually accurately identifying the "right" members of each group.
I am sure that the last word was not said yet, and new algorithms are about to be developed. It will be interesting to see if and when this field gets saturated, and what algorithms will "win" and become standard.

No comments:

Post a Comment