In this article I explain the implementation of a network of contacts I wrote for a website.

In this article I explain the implementation of a network of contacts I wrote for a website.

Introduction

A while back I had to implement a website that offered some social networking. You know, people could be *contacts*, could be *connected* if there was a path following contacts from one to the other, and *disconnected* otherwise. The set of people connected to some given user constituted his *network*. Nobody is contact of himself.

Of course we needed to be able to say whether two people were contacts or not, and to know how many contacts a given user has. But we also needed to be able to determine whether two people were connected, to be able to restrict searches to networks, and to efficiently know their sizes. There are websites out there offering this kind of stuff so the first thing I did was to Google around. I didn't find anything of help though, so this article tries to fill that gap.

Representing Contacts

Since we need to keep track of contacts there's a join table with them. As far as those
definitions is concerned it doesn't matter who invited to whom, so the relation *being contact* is *symmetric*. To simplify the SQL we specify commutativity in the database by doubling the entries in the contacts table. Thus, if *u* and *v* are contacts there are rows (*u.id*, *v.id*) and (*v.id*, *u.id*). This is not necessary technically, it is just a decission taken for convenience that may not be suitable in large sets.

With that table, knowing whether two people are contacts becomes a simple `SELECT`, and knowing how many contacts some user has becomes a simple `SELECT COUNT(*)`. We can also easily know whether two people who are not contacts are at distance 2 from each other, and obtain the ID of the intermediate contact if that's the case:

SELECT a.to_id FROM contacts a, contacts b WHERE a.from_id = #{u.id} AND b.to_id = #{v.id} AND a.to_id = b.from_id LIMIT 1

Note how the duplication of rows simplifies the SQL, we can think of following contacts in some kind of direction. You may offer tests for distance 2, and perhaps distance 3, but this quickly explodes, so that needs some care. Of course you may cache stuff and provide just approximations. LinkedIn for instance does not give the exact size of your connections at distance 2, at least that's my interpretation of the plus sign attached to their counter.

Representing Networks

The contacts relation gives naturally a graph where *u* and *v* are contacts iff there's an edge between their nodes. Since *being connected* is a *transitive* relation, it takes a moment to realize that **any two people who are connected have the same network**.

That's the **first key observation**. Suppose *w* belongs to the network of *u*, which is connected to *v*. By definition there is a path *P* that connects *w* and *u*. Since there is a path *Q* connecting *u* and *v*, the path *PQ* connects *w* and *v*, and so *w* belongs to the network of *v*.

If you visualize this in the graph, at any given moment contact networks are exactly the *connected components* of the graph of contacts. Take a moment to visualize those connected components. I called them *clusters* in the application.

Now the **second key observation** comes: it is trivial to maintain a cluster field in the users table as long as the graph *grows*. Suppose a new user *a* joins the website and becomes contact of user *u*. Right, assign to *a* the cluster of *u*. Suppose existing users *u* and *v* become contacts. Right, join the components if they do not belong already to the same cluster. That's a single SQL statement:

UPDATE USERS SET CLUSTER = '#{u.cluster}' WHERE CLUSTER = '#{v.cluster}'

And this is very important because social networks normally grow. It is rare that people break their edge, so the trade-off wins in mean. With this implementation it is trivial to know whether two people are connected, it translates to compare their cluster field. To compute the size of the network of some user translates to a simple `SELECT COUNT(*)` minus 1, and to restrict searches to the network of a given user you just put the cluster in some join.

If *u* breaks a relation with *v* then you need to pick either of them and recompute their *transitive closure*. That's the trade-off. Either they stay in the same cluster because they are still connected somehow, or either there's a split, that's expensive:

cluster_so_far = Set.new to_visit = Set.new([self.id]) while to_visit.size > 0 user_id = to_visit.entries[0] to_visit.delete(user_id) cluster_so_far << user_id contact_ids = [] rs = dbh.query("select to_id from contacts where from_id = #{user_id}") rs.each {|row| contact_ids << row[0]} rs.free to_visit += contact_ids.reject {|i| cluster_so_far.member?(i) || to_visit.member?(i)} end # Normally IN is faster than OR, and both faster than an UPDATE per identifier. cluster_so_far.to_a.each_slice(200) do |s| dbh.query("update users set cluster = '#{cluster}' where id in (#{s.join(',')})") end

Conclusions

This implementation of a network of contacts has been running in production for some time, albeit the website has just a few number of users and have no experience about its scalability. I figured this out on my own and makes sense to me, but I am sure there's room for improvement and extensions.

One topic you don't mention is how to find community structure within a social network. I've been reading about this lately, wondering if there would be some way to apply it to Advogato. Basically, a community within a site like Advogato is a collection of nodes which have more edges (or links) within the group than outside of it. The links could be Advogato's trust metric certs or the contact links you describe in your network. The traditional method of discovering this is the Clauset, Newman, Moore (CNM) algorithm. CNM has the disadvantage of being relatively slow and unusable on very large networks (400,000+ nodes). You can read details of CNM in their 2004 paper, Community structure in very large networks (PDF format). Last week, two Japanese researchers, Ken Wakita, Toshiyuki Tsurumi, released a paper describing why CNM is so slow and offering three faster CNM-based algorithms. The fastest seems suitable for networks as large as 5.5 million nodes and could process a network the size of Advogato (10k nodes) in seconds. Their paper is Finding Community Structure in Mega-scale Social Networks (PDF formats).

there's a related issue to this: tagging, and being able to search by tags. the sql to do it is ... well, let's just say that if you think you can use mysql to do it then you are still pissing your pants on your mama's knee.

i'll see if i can find the relevant links but for those who have a little more time than i right now it'll be in my diary entries dating back about one year.

**New HTML Parser**: The long-awaited libxml2 based HTML parser
code is live. It needs further work but already handles most
markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!