Who Is Doing It?
A research on Libre Software developers

Fachgebiet für Informatik und Gesellschaft
TU-Berlin

August 2001

[ HTML | PDF ]

Gregorio Robles <grex at scouts-es.org>,
Hendrik Scheider <hendrik at missing-piece.de>,
Ingo Tretkowski <it at ocit.de>
and Niels Weber <nath at snafu.de>

Version: 27.08.2001 21:30 CET

Original: http://widi.berlios.de/paper/study.html

License (GFDL)

Abtract

This paper documents the first comprehensive empirical research on the geographical distribution and personal background of Libre Software developers. While preceding studies relied on either a single tool or examined only a single project, like the Linux kernel, we include a mulitude of utilities, like the application Codd, which automatically examines source code to determine the author, or an online survey which obtained additional personal information never collected before. In addition to that, we examined a great variety of software projects, by accessing an online repository and including linux distributions with the highest rate of developing activity. As a result, we were able to bring an end to the most prejudices surrounding the personal background of Libre Software developers, as well as to state a very promising rank of European developers on a global scale, who are actually as numerous as their North American colleagues.

A preliminary note on the use of the term "Libre Software"

One of the most severe debates fought out on the Internet among the best known promotors of patent free software development found its way to the authors of this paper. It is conflict between different flavors of copylefted software, as represented by the terms "Free Software" and "Open Source". We do not want to distract from the important statements of this study by triggering polemics in this heated dispute, and an evaluation of which term to quote first on each and every occurance would have taken to much time.

This is why we decided to adhere to this new term: "Libre Software". As explained on The Site for Libre Software Developers [http://libre.act-europe.fr/]: "Libre Software is the European term for free software, a term coined by Richard Stallman to denote the users' freedom to run, copy, distribute, study, change and improve the software. We have chosen to use the word "Libre" over "Free" because it avoids the "free beer" confusion. As Richard Stallman puts it: "Free Software" is a matter of liberty, not price. [...] you should think of "free speech", not "free beer"." The authors of this paper fully commit to these ideas and have hence adopted the term for this paper.

2. Introduction & Research goals

This study has been conceived and executed by a research group at the Technical University of Berlin [http://www.tu-berlin.de]. It is part of the current research effort at the Universty's department "Informatik und Gesellchaft" (Computer Science and Society) [http://ig.cs.tu-berlin.de] headed by Prof. Dr. iur. Lutterbeck at the Computer Science Faculty.

Libre Software development follows a completely different paradigm as the one on hich the traditional software development model is based. This new development method invites anybody who is sufficiently interested to participate in a project. There are even tools that facilitate this cooperative work that have been improved during the last years and that entail substantial productivity enhancements.

We can see, nevertheless, that while the process of Libre Software development has been refined during the last years, the information we have about participants is very vague, if not to say almost not existant. For that reason, the present study wants to investigate the personal environment of Libre Software developers.

We have to keep in mind that during a considerable period of time Libre Software applications have been able to prove their quality without any knowledge of the people dedicated to producing it. This clearly demonstrates that the information we are going to put together is not necessary to successfully create efficient software. When a developer wants to participate in a Libre Software project, nobody will ask him neither for his nationality, nor for his profession nor for his age. If he wants to contribute in a constructive way and has the necessary knowledge or skills, yet even if he is only just on the way to acquire them, he is invited to do so. This is said to point out that Libre Software development has been carried out without considering the aspects this study is all about - the personal information. That is good, and it has to remain as is.

But once that is understood, we have to explain why it is important to know more about the developers. There are, as we see it, four main reasons:

Firstly, as far as the current software engineering process is concerned, the more we know about developers, the more information we have about them, the more accurate the theoretical model of that process will be. Such models, which enable precise metrics of development speed, application quality, or immanent problems during development, are indispensable if a similar approach to creating applications is to be taken by the "classical" software companies, and will also help to decrease unnecessary redundancy in or failures of Libre Software projects.

Secondly, we have also been able to state an overwhelming interest in the study's subject among the questioned developers. Either they just like to talk about themselves and their work if asked for, or they indeed have an interest in knowing more about the community they are part of. This alone would be a sufficiently interesting question.

Thrirdly, our preliminary results in June have created considerable political interest in our study, due to the presumed impact of Libre Software Software on the market of software development. The most attention was in fact brought to the question of geographical distribution of developers. Yet, the phenomenon should be of interest on a much larger scientifical scale, as sociology (what takes highliy skilled people to participate in those honorary projects?), business administration (what might be the effect on a company's productivity, if their developers commit to these projects or, even further, how could companies exploit those applications for their own profit?) or economics (since, as we stated earlier, it is not a marginal phenomenon anymore and numbers can be computed with on national and international scale).

And last but not least, with this study we also tried to bring an end to many speculations that surround Libre Software developers. "It is all communists/anarchists", "They do not like their work", "They are teenagers, asocials, geeks", these are well-known prejudices, which do not hold, as a quick glimpse on our results is likely to prove.

3. Research method

This study is in fact only a first step at looking closer on developers. Though at first conceived to unveil the geographical distribution of Libre Software developers, we have been able to obtain interesting information with the Widi survey that go beyond this question and which is likely to call attention on this topic. To simplify further research all the tools used are Libre Software themself. This is important, since independent research teams can thus confirm, cite, even improve, obtained results. In addition, all the consulted sources are easily accessible, thus also verifyable. We are sure that we have not been able to analyze all the data, our main skill was indeed used to gather the most of available data which exists spread over the Internet. We lack the statistical knowledge as well as the sociological or commercial experience to analyze the numbers thouroughly. We invite everyone, who wishes to draw her own conclusions from our data, to do so.

The most important problem we have encountered is the distributed development of most of the Libre Software. Developers are located all around the world, contributing to one or more projects, some of them only work on a small part of the project, quite often they add only a few lines of code. We tried a four-way approach to get most detailed and precise data about the developers:

For detailed information about each data-source please refer to the releated section.

Not only did we hope to complete the picture by accessing different sources with a different spectrum of information, but also to minimize errors by redundant data from each source.

4. The information sources considered

In the next sections we will shortly introduce all the methods used in this research, pointing out their strengths and weaknesses. The chapter will be concluded by a table summarizing those characteristics.

4.1. SourceWell

SourceWell is an index for Libre Software applications. It is part of the BerliOS platform [http://www.berlios.de] that tries to promote Libre Software mainly in Germany and Europe as the Open Source Developer Network [http://www.osdn.net] does.

SourceWell has a database with more than a thousand Libre Software applications. In addition to application name, its version number, a brief description and references for downloading it, the SourceWell database contains the name and the e-mail address of the application's main author (or main maintainer), thus allowing easy geographical catagorization of developers using their e-mail domains. Nevertheless, the system has some skinny points.

The most important flaw is due to the system of project management, where only the maintainer of a project is listed. Although there might be an important number of contributors, the whole lot of lines of code is attributed to the maintainer, thus distorting statistics on developer activity. The best example to explain this is the Linux kernel: in the SourceWell database, there are several branches of Linux listed. Although it is known and proven that thousands of developer around the globe have participated in the development of Linux, Linux appears as if it would have been developed by three authors: Linus Torvalds, Alan Cox and Andrea Arcangeli. If Linux were not so important, possibly only Linus Torvalds would appear as the other branches would not be listed. Yet in this, Linux is an exceptional case, as projects normally, be they big or small (in terms of developer participation), are treated alike.

Secondly, one problem is common to all methods where conclusions are to be drawn from e-mail top level domains. The class of generic top level domains (gTLD), e.g. ".com", ".org", etc., is not country coded. Long ago are times, where those domains were limited to the market of the United States of America. As 50% of all database entries contain a gTLD this is a considerable cause of fuzziness. We will see later, how we try to mitigate this problem thanks to the Codd database.

Finally, we must not forget that SourceWell contains only one thousand Libre Software applications, which is slight compared to an estimated fifty thousand projects.

To conclude, it might seem astonishing to the reader, despite those disadvantages and inaccuracies, the results that we obtained by SourceWell are quite similar to those of the other sources. So we see here not only a validation of the SourceWell statistics, but also good reason for our multi-lateral approach.

4.2. Debian Database

Debian is one of the most popular GNU/Linux distributions, an assumption that can be later confirmed in the results of the Widi survey. It is a non-profit organisation that is managed and maintained by a group of volunteers. Presently, they count more than 700 developers worldwide. But what makes of Debian a valuable data source is not only international constitution of it's contributors, but the "admission process", which everyone, who wishes to participate, has to pass. As a consequence the personal data is highly valid and we do not have to cope with the fuzzyness of gTLDs stated earlier apropros of SourceWell.

This "admission process" [http://nm.debian.org/] is intended to verifiy that the interested person is familiar with the Debian Guidelines [http://www.debian.org/social_contract.html#guidelines], and which present the philosophical basis of the group's work. The fact of this admission process being held in English only is not an impediment regarding the international constitution of the developer team, as we will see later from the results of our Widi survey that the great majority of Libre Software developers speak English.

With high validity being a main trump of this data source, which is publicly available on db.debian.org, there are some negative aspects. One of them is the bad resolution of developer contribution: it is not reflected by the database how much effort can be attributed on the individual programmer. So we are obliged to treat all of them as if they would maintain the same number of packages.

Furthermore, the debian database does not contain the nationality of its members, but only the country of current residence. As the Widi survey demonstrates, this is not the same. Yet there is no way to correct this shift.

Finally, the nature of the "admission process" [Debian New Maintainers' Corner: http://www.debian.org/devel/join/newmaint] favors local network effects, though being open to anybody who wants to participate in the project. Here is why: each applicant has to contact two different developers who are already members: an "advocate" and a "sponsor". The so called advocate will recommend the applicant to the Debian community, while the sponsor is examining the applicant's code, which he wants to contribute. This constellation is clearly favoring regions with a higher rate of Debian developers, because it is easier for an applicant to find his intercessors.

The authors of this study have put themselves in contact with several Debian developers to see if they could obtain more data that we could show and contrast with the other sources of this study. Interesting data would have been nationality, age, profession... nevertheless, for their own moan, they responded to us that they did not have such data. We hope that this study shows that having this data is very interesting.

We have also tried to contact other big Libre Software projects in order to complete this research. We had special interest in the FreeBSD, NetBSD and OpenBSD projects pretending to avoid a solely focus on the GNU/Linux world. Unfortunately, they did not have such data, although they commented it would be interesting to have it.

4.3. Codd

Codd is an application that examines any source code and tries to assign quotas of code to the programmers who have developed it. The program has been developed by Vipul Ved Prakash after having a debate with Rishab Aiyer Ghosh about cocking pot markets (after the publication of an article by Ghosh on this topic in First Monday ["Cooking pot markets: an economic model for the trade in...", Rishab Aiyer Ghosh, http://www.firstmonday.dk/issues/issue3_3/ghosh/]).

Codd allows to investigate big amounts of source code in an automatic way, more than one giga byte have been processed for this study. This amount assures a good validity of the results obtained.

Codd's functionality can easily be devided into two steps:

i) To begin Codd decompresses the source archives (like rpms, tar.gz or tgz packages), into a temporary directory, and performs a simple pattern match search on the source code, to find authorship notes, which might be indicated by the word "copyright" or the copyright sign. The amount of bytes of the examined file is then attributed to the entity found in proximity of the copyright notice; that might be a name, a nick-name or an e-mail address.

ii) Secondly, Codd is trying to identify those entities by use of a database which holds names, e-mail addresses and their aliases of hundreds of developers. In case a developer has "marked" one package with his e-mail address and another one with his nickname, the correct amount of bytes, which is the sum of the two packages, will now be accounted for.

To obtain a geographical distribution, we interpreted any ccTLD as the developer's nationality. But as we stated earlier, more than 50% of e-mail addresses encountered have a gTLD. This required manual search by means of popular Internet search engines, to track down the nationality of the respective developers. Which means that either we found her personal homepage where she noted her origin, or we were able to identify the e-mail address with an alias that has a gTLD that would then be interpreted as nationality.

Yet another major crux of the Codd application would persist and is inherent in the heterogenous nature of Libre Software development: there is no widely accepted standard to mark authorship of code in Libre Software applications. Each project, even each developer, has its own rules or conventions for writing down their contribution, which renders an adaptation of Codds algorithm to the different circumstances almost impossible.

In addition, Codd does not keep track of packages it already examined - which is in fact a difficult task, as different versions of the same package would also have to be taken into count. As a result, Codd might assign the same amount of bytes several times to an author. This effect is especially unpropitious in cases where libraries are referenced in an application; those libraries are processed and accounted for each and every time they are used by other applications.

4.4. Widi

Widi is the part of the study, where we tried to reach the developers directly and asked them to fill out an online form. Compared to the other data-sources this is the most vivid part. However, we were quite dependend on the amount of developers taking part in our survey. To get as much attention as possible, we prepared a press-release and posted it to several open-source related mailinglists and web-communities. For detailed information about the press-release and sites contacted please refer to section 4.4.3.

4.4.1 The Questionaire

The questions that make up the Widi survey have been cautiously considered and very severely discussed. At the end, we came our with three different sections: personal and professional data as well as computer abilities.

Personal data

In this section, initially included fields for the participants name were quickly removed, as they do not have added any relevant information to the results, but instead would have compromised anonymity and our ability to publish the database at the term of the survey. An optional field for the participant's preferred nick name is a lot closer to the target group's culture of pseudonymity. Yet we have not yet decided, if nick names will be published in the released version of the database.

The following two questions make up an interesting pair: At first, obtaining the nationality of Libre Software developers is one of the main aims of our study, as presented in earlier sections. Nevertheless, we also wanted to obtain data on the developers' migratory flows. As the reader will clearly see in the resulting statistics, the numbers of nationality and country of residence show considerable deviations. Debian, for example, did only fournish one of those details: country of residence. It might be interesting to correlate both results.

In the comments posted to the Slashdot forum, one of the community's loudest voices, some realized that the country listing did not follow any logical order. Actually, there was indeed a logical order, but not in the least one which we intended: we took the table from the Free Software application "Les Visiteurs" [http://www.phpinfo.net/applis/visiteurs/]. The countries, there, are listed together with and sorted corresponding to the respective ccTLD.

By asking the year of birth, we wanted to quantify the apperception of a rather young public attending Libre Software meetings. Further on, knowing the sex of the participant is probably the question the least uncertain, yet one should always expect the unexpected...

The following field invited developers to unviel the domain of their e-mail address. With the obtained data we followed two objectives: firstly, we wanted to correlate nationality to residence country and e-mail domain, and secondly compare those results to numbers we obtained by the other sources of this study which are partially or totally based on the information given by the e-mail domain: SourceWell and Codd.

An interesting and at same time vital question regarding migration and the ability to communicate, is the following, asking the participants spoken languages. To what extent English has it become the predominant language in the developers' community? Or how many individuals are excluded by Debian's "admission process" being presented only in English? But the number of spoken languages is as well an indicator of the participants intellectual level. We added Latin and Classic Greek to the list following the repeated demand on the survey's comment board, to satisfy the community's humor, which is important to incite participants to finish the questionary, yet no-one ever chose one of those languages.

Professional data

This block is initiated by a question on the participant's profession. We tried to distinguish jobs in the IT-sector from the rest. Unfortunately we forgot to include options for those working in marketing, management, and product sales, but complaints reached us too late; so we offer our excuses to those who had to wedge in another category.

Technical qualification in the Libre Software community is never based on scholar attainments, but exclusively on peer review of the developer's code, he conceived. Nevertheless tools and technologies used and applications produced as Libre Software are essentially the same compared to the top companies of the sector. Thus one of the prejudices concerning Libre Software developers, saying they were "undergraduate wiz kids" was to be proven or disproven by this question.

Continuing our crusade against prejudices, we introduced a question to know if developers ever get paid for their effort. It is generally presumed that Libre Software development is done during spare time for intellectual profit only. But we also know that there is a certain number of companies which is interested in or already using Libre Software and which has personnel dedicated to its further development. If not being paid, how many count on the experience documented by their participation in Libre Software projects being a key qualification on the labor market someday, and thus finally be financially compansated? So we decided not to leave this question with a simple Yes-or-No answer, but to have the participant explain his situation a little more detailed by offering a range including "rise in salery" and "hoping to get paid in the future"

In order to obtain from developers their emotional perception regarding the favorability of Libre Software development in their country, we explicitly asked them, adding a text field to incite them to ceize their thoughts in their own words.

Next, we broach once again the professional situation of developers, since rumour has it that developers involve in Libre Software projects because they are not satisfied by their work, probably having too little freedom in doing things the way they feel to. We used two questions on this matter. The first of which trying to find out, if the participant felt "accepted" in his company for what he is creating, and the second one asking explicitly for the participant's moral on the job.

Like in previous questions we offered more options than a simple Yes-or-No scheme, with possible answers for the former question of this pair ranging from "my company uses my applications" to "my boss does not even know what the GPL is". Yet, each sub-group of the Libre Software community is very self-aware and intervenes whenever they feel ignored. So we somehow expected the comments we received concerning this question: visitors complained that there were more license models than the GPL. That is perfectly true, but in enumerating each and every available copyleft license the sense of the question would have been veiled, while we simply wanted to express: the boss has no idea what this all is about.

We discussed long about whether or not to include the next question. When it comes to money, being upfront often has an end. We might even risk losing the participant, when he thinks the survey were too indiscrete. But curiosity kills the cat, and amazingly we received more comments demanding a higher scale, than complaints concerning discretion.

To conclude this block, the last question asked for the time the participant is dedicated to developing Libre Software.

Computer science experience and skills

The third and last section in this survey is dedicated to technical issues. But first, we wanted to know in how many Libre Software projects the participant is currently involved in. We wanted to learn how scattered is the effort in Libre Software development. Generally, participation is said to be precise and momentaneous, yet we know of projects with a high rate of volatility. Unfortunately, this single question is far from clarifying this context. But we had to keep in mind not to bore participants with too many questions.

The next questions were dealing with the developing environment. Although we used a lot of our creativity on trying to think of each and every possible tool, we received a lot of comments complaining about the incompleteness of our lists, that we noted tools that were not standardized, and more suchlike. If it was for only one revelation then we can confirm a strong affinity of developers to their material.

But a detailed investigation in the technical knowledge of developers never was our aim, even if Libre Software developers like talking about that. In fact, in asking technical questions we achieved one important side-effect: the participants got enthusiastic about our study, which is vital regarding the important means of word-of-mouth propaganda on the Internet we had to rely on, too, to reach a representative participation. We will comment on this subject again later on. We want to use it as a very flaw indicator of general skill.

More interesting and an important prove of the importance of the Debian distribution we stated earlier is the following question dedicated to the operating system / distribution that developers use. Following requests on the forum, we added HP-UX and TWO later. Yet we were not able to consider demands to distinguish between Mac X and Mac OS, as requests reached us too late.

Besides the favored distribution, equally severe discussions are going on among Unix users about the best desktop and the best (text) editor. With the profit of those questions in the context of our study being rather faint, ever more important once again was offering to the participant a possibility to get personally involved with the questionary. Yet we might discern probable network effects in desktop distribution by correlating results of this question with the developers country of residence.

The last question is ambiguous - for a good reason. After having filled out the questionnaire during the preceding five to ten minutes, we hoped to get an intuitive answer on the philosophical preferences there might be on this subject, which is not in the least less severely discussed. But discussions on the Internet, in newsgroups or forums, tend to drift off topic quickly, and it might be possible that when objectively asked, participants might not have a preference at all.

We did not want to clarify those terms, the differences of which might not be apparent to everybody. Many participants commented in our forum on the fact that "Open Source" was the first option in the list, which proves a strong involvement in the aforementioned discussion. Yet our statistics show that "Free Software" was nevertheless chosen more often.

4.4.2 Announcements And Press Releases

The English press-release:

"There is plenty of literature dealing with the process of developing Open Source and its economical significance. And there are also piles of text about the tools that are used in or for Open Source... but, what do we know about all those committed developers? About where they come from, what project they work for, what they know...?

To cut it short, among all speculations there is little to no empirical data. That's why a research group at the Technical University of Berlin has started Widi (Who Is Doing It?), a questionaire aimed at the strong community of Free Software and Open Source developers to tell something about their social, cultural and professional environment. This data will complete automatic examination of gigs of source code by means of freely available tools the research group is refining.

Although Widi is basically a scientific research, politicans got interested in the last weeks. The intermediate results have already been used at a German congress session to fight against software patents in Europe. A good knowledge about what Free Software/Open Source is and who is behind it, may encourage Governments to change their opinion on this topic and to support Free Software/Open Source."

The press-releases in other languages (e.g. German, French, Spanish) are almost the same (the only changes reflect country-specific environment).

4.4.3 Problems / Possible Improvements

Two major concerns about the survey preoccupied our minds: firstly, we had to design a sensible questionnaire, which no-one in the team had ever done before, and secondly, maybe even more important: "How would we asure a high participation?" The most sofisticated questionnaire is not worth a penny if the return is not important enough.

On the former concern we commented above. Thanks to a lot of critics on the survey's forum, we were able to add or correct some options in existing questions. Yet postings mostly referred to the section of technical questions that were of less importance, but which shows again the personal involvement of participants in this matter. Of more relevance, regarding the aim of knowing more about the personal situation of developers, are questions belonging in the first two sections of the study, that are missing completely, be it because we did not have the idea at all, or because regarding the results have thrown up new questions. Examples are maritial status and number of children. It should not be forgotten, however, that our initial objective was to show the international distribution of Libre Software developers.

The second concern indeed shows more obvious effects. Or rather our attempt to solve the problem: to reach the target group, we came up with a press-release in different flavors, depending on the nature of the Forum (News-Site, Mailinglist, Newsgroup, etc.), and in three languages (English, French, Spanish, and German) which were to cover most of the western hemisphere, which it did. Unfortunately, we did not contact any Asian News-Sites. And while our reputation hurried ahead without our aid on the most important news-site "Slashdot" [http://www.slashdot.com] (based in the United States of America), which resulted in a peek of page hits that temporarily downed the server, we were not able to state a similar effect for the Asian continent. As showed the server statistics, each announcement had a considerable effect on the page hits of the Widi survey. Thus, the land of the rising sun stays one more time in the obscur.

4.5. Summary: Strengths and weaknesses

  Positive Negative To improve
SourceWell
  • Point of view for OS/FS applications
  • Statistics available through web
  • Part of database related to this study released
  • Big and small projects are treated equally
  • TLD-Problem
  • only main author/mantainer or organization/company name
  • only 1100 applications (from about 50000 projects rated)
  • Stable/unstable (applications are counted two times)
  • Developer database (as in Codd)
  • Unify project branches (development/stable) into one
Debian Database
  • Exact numbers
  • big and known OS project
  • Data available to everybody (db.debian.org)
  • all contibutions are equally treated
  • small group (650 members out of 250 thousand)
  • residence instead of nationality
  • requires admission: local network effets possible
  • We need to have it updated
CODD
  • automatic research
  • high amount of source code researchable => very realistic
  • gives developer contribution in bytes
  • previous studies have used it
  • FS/OS application (everybody can confirm our results)
  • Top Level domain - Problem (high rate of "unknown")
  • there is no standard way of giving authorship in source code
  • algorithm has to be improved
  • libraries are listed multiple times
  • Create a developer Database with name, login, Email-addresses and Country
  • Improve algorithm
Widi
  • extended direct information from developers
  • Social, educational and political perspective
  • The database will be released
  • News sites => everybody?
  • Accuracy?
  • Software problems
  • Correlate data to asure correct insertions
  • Improve correlations to obtain more information
  • It could be possible to have a weight with the number of projects given

5. Results

After having seen the different tools and sources used in this study, their strengths and weaknesses it is now time for seeing the most important part of this study: the results.

5.1. SourceWell

In the following section, we are going to give the results obtained from the BerliOS SourceWell database. The data have been taken the 20ht of August from BerliOS SourceWell [http://sourcewell.berlios.de]. That day the BerliOS SourceWell database contained exactly 1136 applications. There are on-line statistics where you can follow the current status of this statistics [http://sourcewell.berlios.de/stats.php3] as this database is being increased day by day.

5.1.1 Distribution by e-mail domain

As previously stated, the SourceWell part is based on the study of the main developer or maintainer e-mail domain. The following pie chart gives us an idea of the distribution of top level domains. 1081 e-mail domain entries have been taken for this part as some applications have an empty author's e-mail address field.

sourcewell-email-domains

[Graph: Top domains listed]

As previously stated there are several top level domains that we cannot assign to a certain country. If we sum up these domains as unknown we will get the next graphic. Note that the "Other" slice has decreased as we have considered for example also minoritaire domains like .cx (Christmas Islands) as unknown. The reason for doing so is that there are many popular free e-mail providers with that domain.

One last word on the top level domains. After having a look at the database we expect a certian positive shift for the .org and .net domains. This is due to the fact that we only insert an e-mail address for any application and this e-mail address has to be as generic as possible. Therefore, many projects from KDE (we have found 39 entries for kde.org), GNU (30 entries for gnu.org), GNOME (10 entries for gnome.org) AND Apache (8 entries for apache.org) have as author a mailing list and fuzzies our results. This four .org domains sum up 87 entries. Note that this does not mean that the 87 entries are mailing lists or generic e-mail addresses; there are many developers with personal domains from these organizations. We only wanted to point out that a shift exist for this domain, not knowing exactly how big it is.

The explanation for the .net domain is other: many developers do not give their personal e-mail address when inserting applications, but the one SourceForge provides when using their services. A fast look at the entries that end in sourceforge.net in our database will give us that this happens in 54 cases.

This means, we expect that the real top level domain distribution for country domains (.de, .fr, .au) including .com and .edu (which are rather very common personal domains comparing with .org and .net) is higher than the given by the pie chart of any announcing system. If we want to be accurate, we have to say that the German .de domain is also little shifted, as 24 entries come from the European SourceForge clone that BerliOS posseses. This is not so big in comparison to the .org and .net shift, but expresses that the .de domain is probably not so negatively affected as domains like .fr, .uk, .au etc. etc.

sourcewell-email-domains-unknown

[Graph: Top domains listed and unknown sumed up]

Many may think, that the .com, .org, .net and .edu are mostly domains used in the US. We can see from Widi, that this assumption is far away from reality. Another interesting point is comparing this results with results obtained from the other parts of the study. This shows that all of them, even if having a different way of obtaining them, follow a similar distribution.

5.1.2. Used licenses

BerliOS SourceWell also gives us some interesting statistics on the licenses Libre Software developers use. In the following graph, you will see the distribution of all the licenses listed in the Open Source site as the ones that fulfill the Open Source Definition and have at least one application in the database. Notice that not all of them are considered as Free Software by the Free Software Foundation. For further details "Various Licenses and Comments about Them" [http://www.gnu.org/philosophy/license-list.html].

As you can see from the results the GNU General Public License is by far the most used Libre Software license. This license is said to preserve freedom in the long term in opposite to the BSD type licenses where you have the right to make anything with the software (even make it proprietary) given that you maintain the authorship notes.

The Open Source licenses that are not seen as Free Software by the Free Software Foundation have also a limited number of applications, mostly the applications from the companies that have published the license (Mozilla by Netscape, QPL by Trolltech, PHP License by Zend).

sourcewell-licenses

[Graph: Number of Applications per license in BerliOS SourceWell]

The interested reader will find more statistics in the BerliOS SourceWell statistics page [http://sourcewell.berlios.de/stats.php3].

5.2. Debian

We will show the results from the data taken from the Debian Developer database the 29th June 2001. At that time, Debian summed up 706 developers from 41 different countries. The first diagram we are going to show for the Debian distribution is the top countries where Debian developers live in. We can see, as other sources will also state, that near 66% of the developers come from the 4 top countries.

debian-distribution-top

[Graph: Debian developer distribution by countries (of residence). 706 Debian developers]

We also think it is interesting to see the distribution in continents. We split the European continent into the countries that build the European Union and the rest in order to obtain some interesting information. From the results we can see that while the United States is the first country in contributing Debian developers, it is the European continente where most of them come from. The 16 countries that form the European Union even have a superior number of developers than North America.

It is also worthwile to note that there are many Debian developers in Asia. The reason for this is that Debian is a very popular distribution in Japan as we can see a big community in that country.

debian-continents

[Graph: Debian developer distribution by continents (of residence). 706 Debian developers]

But while a big majority of the Debian developers come from few countries, we thought that by comparing the number of developer in each country with its population will give us another, very interesting point of view. The population statistics have been taken from the UN previsions for the year 2000 [Population numbers from the Population Division - Department of Economic and Social Affairs, United Nations : http://www.un.org/popin/wdtrends/pop1999-00.pdf].

We will be able to draw many conclusions from the obtained diagram. Firstly, it would be interesting to compare this results with ones from the proprietary software world. We would expect very different results. Secondly, it would be also worth studying why there are countries with such a high density on Debian developers. The information obtained by this could be used to achieve a wider implantation of Debian worldwide.

The results show a very high quote of Debian developers in Northern Europe countries (Finland, Sweden, Norway, Netherlands) and in Oceania (Australia and New Zealand). It would be interesting to know what the causes for such a high penetration.

Also interesting is to note that the mean for the countries from the European Union is one developer for every 1.226.000 citizens, little behind the U.S. This is due, for example, to countries like France which have a big number of Debian developers in comparison to other countries, but whose per capita number of developers is very low (June 2001: 1 developer every 2.049.000 French citizens). France is a country that generally contributes a lot to the Libre Software development, but its involvement in the Debian development does (for some reason we do not know) not follow that assumption.

In the following diagram while the x-axis shows the countries with a lower number of citizens per Debian developer, the y-axis gives the number of citizens per each developer in thousands. Notice: in Germany there is a Debian developer for every 924.000 German citizens.

debian-per-capita

[Graph: Top Debian per capita countries of residence. 706 Debian developers]

We want to finish the part dedicated to Debian showing the Developer map that they show in their web site. In this map, all Debian developers are invited to input their coordinates. The Free Software Foundation has adopted this idea and has an own place where Free Software developers are invited to give their position in the globe [http://www.debian.org/devel/developers.loc]. Watching at the map, we can draw a fast conclusion: Debian, as the whole Libre Software community, also lacks from developers that come from the Third World countries.

developers map

[Graph: Debian Developers Map http://www.debian.org/devel/developers.loc]

More on Debian can be found at "Debian Developer Centre of Mass" [http://people.debian.org/~edward/average/].

5.3. Codd

Package Size: +1067619717 bytes.

codd-nationalities

[Graph: Contributions by nationalities. Package Size: +1067619717 bytes]

codd-continents

[Graph: Contributions by continents. Package Size: +1067619717 bytes]

5.4. Widi

Widi was opened to the public the 24th of June 2001 and closed the 14th of August. There are four entries before the "official" release as we did some previous tests on it. The 24th of June our press release was posted on the front page of BarraPunto.com [http://www.barrapunto.com/] (the Spanish Slashdot) and we did the beta-testing of the system with these first entries. The two major peaks were due to announcements on the German news site Heise Newsticker [http://www.heise.de] on July 6th and the U.S. American news site Slashdot [http://www.slashdot.com] on July 9th.

database
entries

[Graph: database entriess per day. Total entries: 5478]

Database entries per hour

database
entries per hour

[Graph: database entries per hour. Total entries: 5478]

5593 developers filled out the questionaire during that time. We had a quick look at the entries in order to eliminate the most obvious cases, where participants tried to manipulate the data base by double entries. Thus, the final number of entries decreased to 5478. As all the questions were optional, the results for each item have a variable number of entries, but almost all of them are over 5000. We are very happy about this as this shows that the questions were interesting. In addition, we found only a very small number of probable hoax entries (a woman born in Micronesia who lives in Denmark and speaks only English and Spanish, or developers born before 1901), which we left, though they are very improbable.

You can also see the Widi statistics online [http://widi.berlios.de/stats.php3]. At the Widi web site you can "play" with the obtained results (you can have the different results for countries with more than 15 entries, for example). If you are more curious then you may download the software and use it at your convinience, as it is published as Free Software. Widi's databases (the original one and the modified one with the multiple entries deleted) will also be realeased with this paper and are available for downloading. We think that there are many more interesting correlations possible with the obtained data. Feel free to use all of them.

A list of news sites, where we have announced Widi, is at your disposition. They are interesting in addition to the web server log and the timestamps in the database to know where developers came from.

One of the main questions remains the validity of the obtained data, that is true. But we compared results where possible with the other sources of this study and noticed a satisfying similarity.

In the following paragraphs you will find the results that we have gathered from the Widi survey, often referring to the previous sources and tools in order to compare them. We have also tried to find interesting correlations, yet a thorough analysis has to be effectuated by more experienced statistics.

5.4.1. Personal data

Nationality

5478 participants answered to this question. As stated in the previous parts of this study, a few countries contribute with a great number of developers to the Libre Software community, people from 94 different countries have taken part. We present a diagram with the top 15 countries.

Surprisingly is Canada's fourth position as in the other parts of this study, Canada has been far less important than the United Kingdom or the Netherlands. It is also worth noting the presence of Brazil in the top. On the other side we would have expected a higher Japanese participation than 4 developers! Either Japanese developers do not read Slashdotm being the most important international site, or there are indeed very few. We regret not having posted to Japanese news sites.

nationality

[Graph: nationality. Total entries: 5391 out of 5478]

The next graph groups all countries of the European Union. If we compare these results with the ones given by the Debian database, we can see that the results have been shifted towards a more important European participation. There are two possible reasons for this: we can think of the Debian distribution biased towards developers living in North America as the project was started there and countries like France are not as much represented as expected. On the other hand side, Widi was most certainly promoted more on European sites.

continents

[Graph: "Continents". Total entries: 5391 out of 5478]

When comparing the number of developers in each country to the population of that country, the result is surprising: Lichtenstein has the highest density. And in second place we find the Feroe Islands. The Feroe Islands is not a country that is listed in the UN, but as it had its own top level domain it appeared in Widi┬╣s country list. Of course, given the little population, and a single developer in each of both countries, the result is not so surprising anymore.

Looking further, the countries with the highest density are located in three different regions: North Europe (Feroe Islands, Finland, Norway, Sweden, Denmark), Central Europe (Lichtenstein, Germany, Switzerland, Netherlands, Austria, Czech Republic) and Oceania (Australia and New Zealand). Only Canada does not belong to these three regions.

The diagram gives the number of citizens (in thousands) per developer.

per-capita-top

[Graph: Developers per capita - Highest rate. Total entries: 5391 out of 5478]

As many of the countries with high absolute numbers of developers do not appear in the aforeshown diagram, you will find another one, this time showing the number of citizens (in thousands) per Libre Software developer.

per-capita-top-nationality

[Graph: Developers per capita - Top countries on number of developers. Total entries: 5391 out of 5478]

Finally, below a diagram showing the absolute number of entries for each country of residence.

residence
country

[Graph: country of residence. Total entries: 5342 out of 5478]

Migration

Quite interesting to see are differences in nationality and country of residence. The following diagrams show which countries have lost man power, and below which of them seem to offer a more convenient environment for Libre Software developers. We count overall 705 developers that left 78 countries, and immigrated to 49 countries.

emigration

[Graph: emigration. Total entries: 5342 out of 5478]

inmigration

[Graph: immigration. Total entries: 5342 out of 5478]

Finally, we compare the numbers of immigration and emigration, to see which countries have acutally gained (inmigration > emigration) or lost (emigration > inmigration). Not at all surprising might be the fact, that the European Union as lost a total of 111 developers.

positive
migration

[Graph: Countries with psotivie migration flow. Total entries: 5342 out of 5478]

negative
migration

[Graph: Countries with negative migration flow. Total entries: 5342 out of 5478]

Year of birth

The following diagram shows the distribution for the year of birth, with a cluster point at the age of 22. The mean is located at 1974.1, so that the "average" developer is 27 years old. Obviously, the Libre Software community is not made up of teenagers.

year_of_birth

[Graph: Year of birth. Total entries: 5326 out of 5478]

To accentuate the image even more, we clustered birth dates in decades, showing developers born in the seventies are by far the biggest group. But notice also that there are more developers born in the eighties (20%) than there are born in the sixties (16%).

decade_of_birth

[Graph: Years of births per decade. Total entries: 5326 out of 5478]

Gender

Well, at least, there is one prejudice which still holds...

gender

[Graph: Gender. Total entries: 5272 out of 5478]

Top Level Domain

tld-top

[Graph: Top level domain distribution. Total entries: 5185 out of 5478]

tld-pie

[Graph: Top level domain distribution - pie. Total entries: 5185 out of 5478]

We have made the next graphics to see the distribution of the gTLD in the different countires. This can be useful to make correlations with the other data sources. Note the total entries considered for each TLD is lower than in the previous diagram, because que only we have only considered developer entries that have give both e-mail domain and nationality.

tld-pie

[Graph: .com domain distribution. Total entries: 1433 out of 5185]

tld-top

[Graph: .org domain distribution. Total entries: 696 out of 5185]

tld-top

[Graph: .net domain distribution. Total entries: 558 out of 5185]

tld-top

[Graph: .edu domain distribution. Total entries: 287 out of 5185]

We have also made a pie diagram for a ccTLD. As .de is the most frequent, we have decided ourselves for it.

tld-top

[Graph: .de domain distribution. Total entries: 796 out of 5185]

Mother tongue

The language of computer science has always been English, but English with foreign accent. Almost each second developer speaks English as her mother tongue, followed by German and French. Surprisingly, the fourth most spoken language, even before Spanish, is Dutch.

mother_tongue

[Graph: Mother tongue. Total entries: 5369 out of 5478]

Spoken languages

Almost every developer speaks English, which proves that language is no hindrance in applying as a debian developer. It is followed by German, French and Spanish.

language_spoken

[Graph: Which languages do developers speak. Total entries: 12847 from 5478 developers]

Profession

Looking at the diagram showing developer┬╣s profession, it is interesting to see the high number of software engineers and programmers that exist, which together almost sum up 50%. We clustered understand university professors and assistants under the term "University" (IT or other).

profession

[Graph: profession. Total entries: 5333 out of 5478]

As you have already noticed from the former diagram, we have divided the possible entries into the IT branch and all the other ones. In the next pie we see that the Libre Software movement is strongly tied to the IT sector with almost 80% of the developers related to it. But compare this situation to architecture and imagine 20% not specifically educated architects designing houses!

profession_sector

[Graph: Profession by sectors]

Another interesting point is to state what role universities have in Libre Software. Historically, universities have been the starting and meeting point for many famous developments and ideas. We can see examples in Richard Stallman being at the MIT, the development of The GIMP (in Berkeley, California) and many development projects in German universities. When we put together students (from IT and other branches), professors, and assistants, one out of three developers is related to university.

profession_university

[Graph: profession in relationship with universities]

The last results show that the Libre Software community is made up of a mixture of highly qualified workers (software engineers, programmers), people related to universities and those from other branches.

Qualification

Libre Software developers are very well prepared people. It is interesting to see how a movement that is compound of so many young people has such a high qualification. An interesting question is how many of high-school and A-Level graduates are currently studying, thus on their way to obtain a university degree.

qualification

[Graph: qualification. Total entries: 4872 out of 5478]

Being paid

20% of all developers are being paid for developing Libre Software.

paid?

[Graph: Are developers being paid? Total entries: 5015 out of 5478]

Profited

Even if not being paid, many expect to profit from the experience they get from developing software. The next diagram can be read in several ways:

Firstly, it shows that the amount of developers that have profited from developing Libre Software and the ones who say they have not is almost equal.

Secondly, it tells us that one of every two developers who has not profited so far is optimistic about this happening in future.

Thirdly, we can see that getting a job for having developed Libre Software is quite normal, but it is very unprobable to get a rise in salary for doing it. Maybe this is a fault of many companies that do not see that this is a great investment for having better trained and satisfied employees.

profited?

[Graph: Have developers profited professionally because of their engagement in Libre Software?. Total entries: 5292 out of 5478]

Boss

The most usual situation is a boss knowing an employee develops Libre Software and not to care about it. But if we sum up the ones who use in their companies software developed by them or just encourage them to make it, the number is even higher than in the first case. One forth of the superiors do not know that they have Libre Software developers in their team.

boss

[Graph: What does the boss think about the developer's engagement in Libre Software?. Total entries: 5194 out of 5478]

Job

Libre Software developers love their job. That is what almost one of every two says. If we add the ones who find it interesting, we will get percentages that are near 80%. This is surprising as it is commonly said that one of the reasons why Libre Software exists is because unhappy developers do their developments in their spare time. And it is more surprising if have in mind that we are not living now the euphoric situation that the IT sector has had in the past years. The other way round, nowadays we have a more pessimistic view of the economy in general and of the IT branch in particular.

job

[Graph: Do yo like your job?. Total entries: 5153 out of 5478]

Opportunities

Only one fourth of the consulted developers think that there are more opportunities abroad than in their own country. Some more think it is similar and almost one half thinks it is better in their own one. The answers are far away from what we expected. The explanation we give to this is that in countries where Libre Software has been implanted, developers feel that it is better inside than abroad. That is logical. On the other side in countries where the movement is doing its first steps, developers are also very optimistic: they think that in next future there will be a lot of opportunities that surround this evolution and that it is not so bad as one might think or as the ones from the first countries might think.

opportunities

[Graph: Are ther enough opportunities in your country?. Total entries: 4333 out of 5478]

Income

As we can see from the diagram and due to having a lot of students in this community the yearly income distribution is very constant. On our side we think this is the answer with the worst results as we have stated that many have given far more as their income as it is logic.

yearly income

[Graph: Yearly income. Total entries: 4627 out of 5478]

Correlation between like your job and income

Hours per week

We can see one of the Libre Software paradigm results in the next graphic. Very few do develop more than 20 hours a week, most of them less than 10 hours.

hours per
week

[Graph: Hours developing per week. Total entries: 5233 out of 5478]

Number of projects involved in

This results are just a compendium to the last ones. Maybe we should say that it is curious that when the function is monotonly falling, we then have a peak at 10 projects. If we see the fully chart in the Widi statistics web page [http://widi.berlios.de/stats.php3], we will be able to see that this happens again with 20 projects. The explanation tor this is that developers entered an approximate numbers when having many projects.

number of
projects

[Graph: Number of projects involved in. Total entries: 4843 out of 5478]

Favorite distribution / operating system

Debian is the favorite distribution for developers. It is interesting to note that Debian is the favorite one in almost all countries, while Red Hat is the second one in the US and SuSE is widely used in Germany.

It is worthwhile noting how many have give Windows as their favorite operating system as this platform is generally badly seen between FS/OS developers. The number of BSD preferences is also very high in comparison to the little media interest that this systems usually have.

distribution

[Graph: Favorite distribution. Total entries: 5317 out of 5478]

If we sum up all the GNU/Linux distributions, we will find that a great majority of developers find them as their favorite ones. All the UNIX-type systems would have almost 90% (GNU/Linux, BSD type, Solaris and HP-UX).

system

[Graph: Favorite System. Total entries: 5317 out of 5478]

Favorite Desktop

This pie speaks for itself: GNOME and KDE have conquered the UNIX desktop and almost have the same quote on it (GNOME has little advantage). Interesting is the number of groups that specify Mac and Windows as their favorite ones and curious are the ones who say that they prefer the console.

desktop

[Graph: Favorite desktop. Total entries: 5261 out of 5478]

Favorite Editor

In few words: The Emacs vs. vi war is over.

editor

[Graph: Favorite editor. Total entries: 5345 out of 5478]

Open Source or Free Software?

It is quite worth noticing that in French and Spanish speaking countries the term "Free Software" (logiciel libre in French or software libre in Spanish) is the prefered one. This is one of the reasons for the creation of new the term "Libre Software", that we have been using in this paper. [http://libre.act-europe.fr/].

fs_or_os

[Graph: Open Source or Free Software. Total entries: 5104 out of 5478]

Languages and Tools

One of the most interesting results of this question is that there is almost no difference between countries and geographical regions. Developers in Europe have the same knowledge distribution as the ones in America and this can be compared up to country level, where we can see that the diagram is approximately the same.

tools and
langs

[Graph: Tools and programming languages developers are experienced in. Total entries: 5392 out of 5478]

6. Conclusions

The first interesting result is the good position of Europe in the global context. This was shown by each source we accessed, most clearly by the Widi survey, which might on the other hand be due to an overweight of publicity on European sites. Yet alarming is the considerable amount of developers that have left Europe - for reasons that we did not ask for - to work in the United States.

Furthermore, this paper has shown that many prejudices that surround Libre Software developers are not true. Although developers in average are indeed young and masculine, most of them have a good eduation, develop during a reasonable amount of hours a week, and, surprinsingly, the number of those, who profitted from developing Libre Software, be it professionally or monetarily, is quite high.

7. About the authors

Gregorio Robles is a Spanish exchange student who is doing his diploma thesis at the Technical University of Berlin in contact with BerliOS. He's implemented almost the whole Widi system. He is also coauthor of the SourceWell system.

Niels Weber will be doing his diploma thesis as a continuation of this study combined with the data from other studies and gained by other means. He was this groups main CODD hacker.

Ingo Tretkowski is a Computer Science student at the Technical University Berlin and works as a web-application developer at Siemens Business Services Germany.

Hendrik Scheider is student of computer science at the Technical University of Berlin, with a focus on legal and ethical implications of new technologies, and has worked as a web-applications developer in Germany and France. His main tasks were translation and design of the Widi questionary.

8. References

8.1. Other references related on this subject

The Site for Libre Software Developers [http://libre.act-europe.fr/]

Josh Lerner and Jean Tirloe, Harvard Business School and NBER, 29.12.2000 "The Simple Economics of Open Source" http://www.people.hbs.edu/jlerner/simple.pdf

"Why Open Source Software / Free Software (OSS/FS)? Look at the Numbers!", David A. Wheeler, August 3 2001, http://www.dwheeler.com/oss_fs_why.html

Population numbers from the Population Division - Department of Economic and Social Affairs, United Nations : http://www.un.org/popin/wdtrends/pop1999-00.pdf
There are new numbers: World Population Prospects: The 2000 Revision - http://www.un.org/popin/

8.2. Software

Widi online statistics & form: http://widi.berlios.de

Widi Project Page: http://widi.berlios.de/html

Codd Project Page: http://codd.berlios.de

SourceWell Project Page: http://sourcewell.berlios.de/html

8.3. Databases

Widi databases: http://widi.berlios.de/database.php3

Dump 20th August BerliOS SourceWell: http://widi.berlios.de/database.php3

Debian developer database: http://db.debian.org

BerliOS SourceWell: http://sourcewell.berlios.de

9. Other studies on this topic

Although our study has covered as many aspects as possible, we agree on the point that there are other methods that might be interesting when analizing this topic. We have looked at all the studies we know that have been done in this area until now. In the following paragraphs we want to describe them briefly in order to give the reader an idea of other approaches to this subject.

Free Software Foundation - View the community

The Free Software Foundation has started a project to visually represent the geographical positions of members of the Free Software community. It is based on the application used for de Debian Developer map that has been studied in this paper. The developer has to register himself to become member. Once done this he can fill out a formular giving his coordinates. Results are displayed with little delay in a world map that you can find in several sizes.
[http://france.fsfeurope.org/coposys/index.en.html]

TPM - The Trinity Participation Metric

TPM - The Trinity Participation Metric is a Metric for Libre Software Developer's Participation. This metric uses source code, the mailing list messages and the posted bugs to obtain its results. The three inputs are weighted accordingly to their relative importance to achieve better results. The TPM has been yet focused only on one Libre Software application, the GIMP.
[http://www.cse.ucsc.edu/~alison/projects/cmpe276/index.html]

Linux Study

The Linux Study is a scientifical analysis of the processes involved in Linux kernel development from a social science perspective. This research was based on a questionary that was filled by almost 150 Linux kernel developers.
[http://www.psychologie.uni-kiel.de/linux-study/]