Node similarity as a basic principle behind connectivity in complex networks

How are people linked in a highly connected society? Since in many networks a power-law (scale-free) node-degree distribution can be observed, power-law might be seen as a universal characteristics of networks. But this study of communication in the Flickr social online network reveals that power-law node-degree distributions are restricted to only sparsely connected networks. More densely connected networks, by contrast, show an increasing divergence from power-law. This work shows that this observation is consistent with the classic idea from social sciences that similarity is the driving factor behind communication in social networks. The strong relation between communication strength and node similarity could be confirmed by analyzing the Flickr network. It also is shown that node similarity as a network formation model can reproduce the characteristics of different network densities and hence can be used as a model for describing the topological transition from weakly to strongly connected societies.


I INTRODUCTION
In an increasingly interconnected world, it must be of huge interest to understand the topology of a highly connected society; important, for example, for predicting the spread of epidemic diseases [Pastor-Satorras and Vespignani, 2001]. A basic measure to describe network topologies is the distribution of the number of links per network-node. Many real networks show a node-degree distribution that approximately follows a power-law -a right-skewed heavy-tailed distribution also known as scale-free distribution [Barabási and Albert, 1999]. But other real networks show a truncated power-law or even an exponentially shaped node-degree distribution [Amaral et al., 2000, Newman, 2001, Jeong et al., 2001. To investigate network topologies it is essential to understand the basic principles behind connectivity or, more precisely, the process of network formation. Most of the current models are basically focused on reproducing a power-law (scale-free) network topology. The most popular model is a network growth model based on the idea of preferential attachment: new nodes prefer to link to existing highly connected nodes [de Solla Price, 1976, Barabási andAlbert, 1999]. But a high node-degree may rather be the result than the cause of connectivity as shown by other models of network formation, including the node copying model [Kleinberg et al., 1999] and the fitness model [Caldarelli et al., 2002, Servedio et al., 2004. Even though most models reproduce quite well a power-law distribution, they do not explain the frequently observed divergences from power-law. For each pair of a randomly chosen set of 10,000 Flickr R users, the number of identically used keywords (tags) is set into relation with the pair-wise communication strengths. The histogram shows the mean communication strength of all pairs within intervals of 100 keywords. This clearly confirms that a higher number of identical keywords is strongly related to higher communication.
Social sciences have a long history in explaining social communication and interaction and a huge amount of literature from this field suggests that similarity (homophily) is the major factor for connectivity in social networks as, for example, reviewed by McPherson et al. [2001]. People tend to associate with those sharing similar interests, tastes, beliefs, social backgrounds, and also similar popularity. This is often expressed by the adage 'Birds of a feather flock together'. Recent analysis of mobile phone data further confirms that communication is strongly related to geographic distance. There is a higher chance of people calling each other if they live closer to each other (similar location). The total amount of communication between two cities depends on their distance and population size, which can be well described by a gravitation model [Lambiotte et al., 2008, Krings et al., 2009. In biology, interactions between proteins or other molecules require an exact fit or complementarity of their complex surfaces which have to be treated synonymously with similarity in the context of connectivity. For communication and interaction, space and time are often the dominant factors. 'To be in the right place at the right time' works often as the basic principle for getting connected, but beside fitting in space and time additional properties are important: for instance similar surfaces of molecules, or similar interests of people. In mobile phone networks, it can be shown that other factors besides geographic distance influence communication, e.g., language [Lambiotte et al., 2008]. Such additional factors become even more important in virtual communities in which geographic distance does not matter and written communication does not require the presence of the networked partner at the same time. In information networks, location and time are also not the dominant factors. In general, articles are linked because of similar topics, scientific citations have a strong relevance to the author's work, and websites are mostly linked to websites of similar content [Flake et al., 2002].
Online social networks are an ideal source for investigating complex networks because of the often huge number of users, their link and communication profiles, and the availability of additional metadata such as tags (keywords). Several recent studies confirm the impact of similarity on links in social online networks by analyzing tag (keyword) metadata between users [Marlow et al., 2006, Aiello et al., 2010. But most studies focus on an unweighted contact (declared friends) network structure. By contrast, this study analyzes communication strength between users. This provides us a more precise description of user interactions in terms of weighted links or contact intensities useful for analyzing the transition from sparsely connected to densely connected networks. In the first step, by analyzing the Flickr R social online network, this study shows that communication strength is directly related to tag (keyword) similarity. In the second step, the Flickr R network is used to analyze different network densities. It turns out that more densely connected networks show an increasing divergence from the power-law distribution. This characteristic can be reproduced by a network formation model based on similarity, as shown by the Euclidean distance model proposed in this work.

II SIMILARITY IN THE FLICKR NETWORK
Flickr R is an online photo sharing community. Here, we analyze how users interact and communicate by commenting on photos of other users. Data were collected in 2009 by using the application programming interface (API) to the Flickr R database at https://www.flickr.com/services/api/ . The number of comments of one user A to another user B is used to define the strength of communication, and hence gives the weight of the link between A and B. In this study, similarity between two users is based on keywords (tags) people use to describe their photos. The similarity is defined as the number of identical tags of two users: the size of the intersection of the tag sets of user A and B from all of their photos. People who use the same keywords are supposed to have similar photographic interests which, in turn, may lead to communication. Setting the number of identical keywords into relation with the number of comments between two individuals, as shown in Figure 1, reveals a clear dependency between similarity and communication strength. The intensity of communication between two individuals is strongly related to the number of identically used keywords, thereby confirming empirically that communication strength depends on similar interest of individuals.

III FROM SPARSE TO DENSE NETWORKS
In order to investigate how node-degree distributions depend on network density, the difference between sparsely and densely connected topologies is analyzed. Since most networks are rather sparsely connected, including the Flickr R network as a whole, a more densely connected subset of Flickr R is exemplarily chosen: the Flickr-group 'Light Painters Society' (id:1066685@N25) having 6,036 members (nodes). By using different thresholds for the number of comments to be accepted as a link, the degree of overall connectivity can be varied from sparsely to densely connected networks. Figure 2 shows the in-degree distribution counting only strong links (more than or equal to 20 comments), medium-weighted links (more than or equal to 2, 3, or 6 comments), and all, including very weak links (at least one comment). It reveals that only a sparsely connected network shows the typical scale-free power-law like distribution. Densely connected networks, by contrast, show a distribution which is very distinct from power-law.

THE NODE SIMILARITY MODEL
The observed characteristics of real networks can be reproduced by a simple similarity model based on Euclidean distance in pure random data. This is demonstrated by artificially generating a network from a 100 x 8000 normally distributed random data matrix X , according to m = 100 properties and N = 8000 network nodes. Two nodes x i and x j are defined as connected if their Euclidean distance d = x i − x j is below a certain threshold. Increasing this threshold means changing the network density from sparsely to densely connected. As shown in Figure 3, a similarity model generates the same shapes of node-degree distributions as observed in the real network ( Figure 2). A MATLAB R implementation of the similarity model is available at: http://www.network-science.org/similaritymodel.html .

BENEFITS OF A SIMILARITY BASED MODEL
Beside the relation shown between similarity and connectivity strength, there are a number of other points that show that a similarity model is an appropriate and natural way to describe real complex networks: a) Most network formation models are developed to reproduce only power-law distributions such as in Figure 2A. Thus, they cannot explain node-degree distributions distinct from a powerlaw as in Figure 2 C to E. A similarity model, by contrast, covers naturally the full observed diversity from power-law to non-power-law distributions. b) A similarity model does not depend on dynamics in network size such as an increase or decrease in the total number of networks nodes. It therefore works within situations of network growth as well as shrinkage or even for pure reorganization of links in a network of constant size. Since also power-law like distributions can be reproduced by the similarity model, the observed power-law characteristics of real networks are not necessarily a result of network growth. c) Because of the usually undirected property of similarity, it is a natural model for undirected networks in which connections are induced from both sides as in social networks. But similarity works also in a directed manner when additional factors such as time in a growth model (e.g., citation network) enforce directed relations. In almost completely connected networks (D and E) the node-degree distribution appears as an inverse power-law: most nodes have a high degree whereas only few nodes have a low degree. d) Similarity does not require global knowledge such as node-degree about all network nodes. Similarity refers only to the local environment of people in real physical as well as in virtual communication worlds. People who live in the same place, engage in similar activities, or members of online communities meet each other and connect according to their similar behaviors and interests. A global knowledge about all people is not necessary. e) A similarity model explains the topological transition from sparsely to densely or even completely connected networks which a pure power-law model does not. Completely connected networks in which each node is connected to each other do not follow a power-law distribution, instead, all N nodes have the same maximum degree, k = N − 1. Thus, with increasing connectivity there must be a transition from the power-law topologies (Fig. 3A) of sparsely connected networks to the peaked distributions (Fig. 4E) of completely connected networks. A similarity model can describe such transition from sparsely to densely connected networks as shown in Figure 3 and, in addition, to completely connected networks (Fig. 4). For almost completely connected networks the similarity model predicts a left-skewed distribution inverse to the power-law in which most nodes have a high degree and only a few nodes have a low degree.

CONCLUSION
This work demonstrates that the frequently observed scale-free power-law distribution can be well reproduced by a model which is purely based on the idea of node similarity. Since similarity is independent of dynamics in network size such as growth or shrinkage, the observed power-law of real networks is not necessarily caused by the growth of networks. In addition, a similarity model shows that the frequently observed distributions distinct from power-law are a characteristics of more densely connected networks. This means that the differences we can observe in node-degree distributions of real networks are mainly given by their overall link density: whereas the typical sparsely connected networks show power-law distributions, densely connected networks show non-power-law distributions. This can be further extended to almost completely connected networks as can be found in a family or a small village in which everyone knows everyone else. While in sparsely connected power-law networks most nodes have a low number of links and only a few are highly linked, almost completely connected networks show the opposite: most nodes have a high or even maximum degree and only a few nodes have lower degrees. These less connected nodes may represent outsiders in an almost completely connected clique. Since a similarity model explains the entire topological transition from sparsely to densely connected networks it is able to explain the transition from lowly connected to highly connected societies.

ACKNOWLEDGMENTS
Funding was provided by the German Ministry for Science and Research (BMBF) within the program Entrepreneurial Regions: Competence Centers, under code 03 ZIK 011. I would like to thank the Department of Mathematics and Informatics of Ernst-Moritz-Arndt-University for providing computational facilities and Martin J. Fraunholz for valuable discussions and critically reading the manuscript.