Knowledge Engine by Wikipedia

Posted on February 15, 2016 by wittylama

This is a revised and updated version of an email I first wrote on the Wikimedia-l mailing list three days ago.

After sustained pressure from the Wikimedia community, the formal agreement for the Knight Foundation to provide $250,000 towards year-one of the “Knowledge Engine by Wikipedia” project was released this week. You can read it for yourself here.

This document specifically and overtly states that its purpose is to start work on a search engine in opposition to Google/Yahoo. More importantly, building such a thing is a potentially valid way of responding to current trends online. Specifically: the shift to mobile; decreased pageviews; and Wikipedia-derived information being displayed in the search results on Google. Again, the issue should not be about defining what a “knowledge engine” is, but that the project was envisaged in secret. As summarised in The Signpost:

In a November 4 email to all WMF staff, provided to the Signpost by several WMF staffers, Executive Director Lila Tretikov expressly stated that the Knowledge Engine “is NOT … a search engine”.
On February 11, Jimmy Wales stated on his talkpage: “To make this very clear: no one in top positions has proposed or is proposing that WMF should get into the general ‘searching’ or to try to ‘be google’. It’s an interesting hypothetical which has not been part of any serious strategy proposal, nor even discussed at the board level, nor proposed to the board by staff, nor a part of any grant, etc. It’s a total lie.”

In contrast to these statements, this is what is written in the actual text of the Knight Foundation grant:

“Knowledge Engine by Wikipedia will be the internet’s first transparent search engine, and the first one originated by the Wikimedia Foundation”. It will, “democratize the discovery of media, news and information – it will make the Internet’s most relevant information more accessible and openly curated, and it will create an open data engine that’s completely free of commercial interests. Today, commercial search engines dominate search engine use of the internet…” [p10]. “The project will pave the way for non-commercial information to be found and utilised by internet users” [p2]

At the bottom of page 13, the primary risk identified is that “interference by Google, Yahoo or another big commercial search engine could suddenly devote resources to a similar project”. As SarahSV pointed out, if the “Knowledge Engine by Wikipedia” is only about improving the inter-connectedness of the Wikimedia sister projects by improving how internal systems work – which no one is disputing is a very useful goal – then Google/Yahoo releasing a new search engine product would not be counted as the project’s “biggest challenge”.

[mockup of a knowledge engine search result page, as presented to the Knight Foundation in April 2015]

Jimmy Wales declared that “suggestions this is some kind of broad google competitor remain completely and utterly false” and simplistic reporting from mainstream media implies that this “search engine” is effectively synonymous with what-Google.com-looks-like. However, the proofs of concept Ask Platypus and Tuvalie, and of course the only other thing that actually calls itself a “knowledge engine” – WolframAlpha, are all valid alternatives of what a search engine can be.

Let’s be clear: the very fact that the Knight Foundation approved the release of the official grant contract is a demonstration of their integrity and why they are such an important partner to the Wikimedia movement. The issue should not be about whether the “Knowledge Engine by Wikipedia” is a good idea, or what it looks like. The issue is that it has been prepared secretively, arguably without clarity even for those who did know about it in advance, and certainly without a strategic plan.

“Non commercial”

The document itself refers to “non commercial” several times, and seems to be using the term loosely. Nevertheless, it seems clear to me that any reasonable person who is not deeply-immersed in copyright debates about the definition of “free” would understand the words “non commercial” in the context of this document to mean that the search engine is operated non-commercially. Now, I do acknowledge that a grant-request is by definition a “sales pitch” and you have to write your request using the terminology and focus areas of the grant-giver. However, it is my understanding that Lila specifically wanted to build this – a competitor to Google – and that this is most clearly expressed in the summary on page 10. It describes the 6 principles through which the “Knowledge Engine by Wikipedia” will “upend the commercial structure [of search engines]”. These are Public Curation, Transparency, Open Data, Privacy, No Advertising and ‘Internalisation’.

Nothing in this document talks about ways to limit the content of the search engine to only “non commercial” stuff (and if it did, we would be talking about partnering with search.CreativeCommons.org).

Lack of Strategy

Now, maybe an open-source search engine would be a good thing for the WMF to create! But that would be a major strategic decision. It would be, in effect, a new sister project to sit alongside (above?) Wikipedia, Commons, Wikidata etc. It is arguably within the Wikimedia mission statement to build something like that. That is not the problem. The problem is the secrecy. Or, as summarised by TheDJ, “Great idea… terrible management”.

The “Knowledge Engine by Wikipedia” concept appears nowhere in the current strategy consultation on Meta. As I wrote on this blog last week in Strategy and Controversy, Part 2: “Of 18 different approaches identified in the…consultation process only one of them seems directly related to [search]: ‘Explore ways to scale machine-generated, machine-verified and machine-assisted content’. It is also literally the last of the 18 topics listed”.

It seems to me extremely damaging that Lila has approached an external organisation for funding a new search engine (however you want to define it), without first having a strategic plan in place. Either the Board knew about this and didn’t see a problem, or they were incorrectly informed about the grant’s purpose. Either is very bad. And let me be very clear – this is not a case of the WMF Grants department going off by themselves. This is an executive decision by either the Board-to-Lila, or Lila by herself. The latter seems more likely given her own statement on her talkpage:

“In the staff June Metrics meeting in 2015, the ideation was beginning to form in my mind from what I was learning through various conversations with staff. I saw the Wikimedia movement as the most motivated and sincere group of beings, united in their mission to build a rocket to explore Universal Free Knowledge. The words “search” and “discovery” and “knowledge” swam around in my mind with some rocket to navigate it. However, “rocket” didn’t seem to work, but in my mind, the rocket was really just an engine, or a portal, a TARDIS, that transports people on their journey through Universal Free Knowledge.”

[The original logo of the project – a search icon inside a rocket – from a presentation dated June 30]

As pointed out by Risker back in May 2015, the Search team had already been created and seemed disproportionately large. It seems clear to me that this was done in anticipation of the “Knowledge Engine by Wikipedia” project, as it was described in this grant document. Since then, this very high initial target has since been reduced, a lot. It is now defined as: “improving the existing CirrusSearch infrastructure with better relevance, multi language, multi projects search and incorporating new [external] data sources for our projects.” (as described in the Discovery Department’s FAQ response to the question – “are you building Google?“).

However, this change is not represented in the actual grant document. The deliverables for this stage of the project are improvements on existing products – but the overarching purpose of the project is most certainly not. That either means we misled the Knight Foundation at the start, or we changed our mind since then but didn’t tell them. Much more likely, in my opinion, is that the Knight Foundation knew that trying to create a non-profit search engine was high-risk or at the very least extremely ambitious. So instead they gave a smaller exploratory grant which helps to fund some genuinely useful activities (the “outcomes” of this first stage, as listed on page 3 and also page 12).

Also, let me reiterate – improving the “discoverability” of our own content across wikis/sister-projects is a very good goal. Consolidation/Integration of projects’ content is much desired (e.g. the much-longed for ‘structured data on Commons‘ project) and everything on the Discovery team’s own list of current priorities is great. However, those things are not what have been “sold” as the end result of this grant, even taking into account the adaptability inherent in agile-software development projects.

[The first public appearance of the words “knowledge engine”. Slide 50 of the WMF June 2015 Metrics meeting presentation]

Cost

Page 10 of the grant text specifically says that the cost of the first stage of “Knowledge Engine by Wikipedia” is $2.4 million, and that the grant is for 1 year starting in September 2015. Page 2 says that the whole project is in 4 stages, each lasting approximately 18 months = 6 years. This grant of $250,000 therefore only covers 10% of the cost, of the first stage, of the total project.

As SarahSV asked on Wikimedia-l (reiterated by Pine), “The document says the ‘Search Engine by Wikipedia’ budget for 2015–2016 ($2.4 million) was approved by the board [page 9]. Can you point us to which board meeting approved it and what was discussed there?” I second this question, because I’m not seeing it in the current annual plan.

There is no way that Lila approached the Knight Foundation asking to fund only 10% of the first year of a 6 year project. Instead, as revealed in The Signpost, the actual amount initially requested as a grant was $6million over three years and negotiations have concluded at $250,000 for the first year only. The first stage is also the cheapest of the four stages (“discovery, advisory, community, extension” – described on page 2). Per the document quoted in The Signpost, the budget was expected to “increase by 20% per year as we accelerate the growth of the program” with the 2017–18 estimate at $3.5million.

We can therefore reliably extrapolate that $12Million is the absolute minimum amount that was planned to be spent over six years. As pointed out by Doc James on the [public] WikipediaWeekly Facebook group – estimates presented to the board were in the range of tens-of-millions.

The Signpost also revealed that the WMF hoped to fund the difference between their initial request to the Knight Foundation (let alone the much reduced amount they actually received) with “…funding from the Wikimedia Foundation’s general fund or from additional restricted grants”. The WMF “general fund” in this sentence can only mean the revenues raised through the annual fundraiser. This makes its deliberate absence from any documents shown to donors, or itemised in the annual plan, all the more concerning. It cannot be that this project was a secret in order that “Google doesn’t find out”. That would be a misunderstanding of our mission and values, and it also underestimates their intelligence – Google would notice anyway once we started actually building “it”. So, why the secrecy‽

It is inconceivable to me that the Executive Director would privately propose to an external partner that they would undertake a six-year project to build a search engine that will have massive cost, staffing, strategic and content implications – entirely without an official WMF strategy covering that period, no indication in the current annual plan, without the awareness of the community, and unclearly communicated to the Board. I find the fact that this could have been done to be a deep breach of our values – and not wholly unrelated to the current sudden exodus of long-serving senior members of WMF staff.

[The first post in this series “Strategy and Controversy” was published on January 8. Part 2 was published on January 30.]

This entry was posted in Montgomerology, Wikimedia, wikimedia foundation and tagged knowledge engine. Bookmark the permalink.

19 Responses to Knowledge Engine by Wikipedia

Nemo says:

February 15, 2016 at 20:44

I think it would be useful if the discussion focused on concrete matters, like the seemingly pointless power struggle by some WMF product manager to seize http://www.wikipedia.org (which has always been managed by the community via Meta-Wiki administrators). https://meta.wikimedia.org/wiki/Wikipedia.org_Portal_Improvements

Or the very useful investigation work performed for the first time in ages on the correctness of search results, in multiple languages, by some new hire https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes (it feels like the last time someone actually looked into search results was 2011, with http://laxstrom.name/blag/2012/02/13/exploring-the-states-of-open-source-search-stack-supporting-finnish/ ).

Or the mysterious refusal by someone in the Discovery team to work on the longest-held request by the community as regards search, i.e. the inter-project and inter-language search (https://phabricator.wikimedia.org/T109957#1909202 + https://phabricator.wikimedia.org/T3837#1909315 ), which was made technically possible by the CirrusSearch/ElasticSearch work in 2013 and then mysteriously abandoned.

In the best case, we’ll discover that all this “knowledge engine” chatter was just a fancy wrapping for long-overdue work on 10+ years old bugs and feature requests in our search engine, that will help under-served users in hundreds of languages and hundreds of sister projects. (I’m still hopeful! after all, some good work got done in these months, and the majority of the work done is good, except the http://www.wikipedia.org stuff.) In the worst case, we’ll discover that the chatter *does* match reality and WMF is, as usual, working on weird stuff while neglecting the important work with obvious benefits.
- ckoerner says:
  
  February 17, 2016 at 15:52
  
  Nemo, we met in Lyon at the hackathon last year. I know your intentions are good. Can we move this conversation to a talk page somewhere? I think your points are worth discussing on-wiki and I’d like to get more attention on them from the rest of the Discovery team.
Pingback: Un projet de moteur de recherche sème la discorde chez Wikipedia | Flynews
Pingback: Un projet de moteur de recherche sème la discorde chez Wikipedia | Own
Stas says:

February 16, 2016 at 22:33

Nemo, it seems you’ve missed something that happened lately, because Discovery team did work on improving search, including inter-wiki and multi-language search (e.g. we’ve just got TextCat language detection engine working and plan to do A/B tests next to see how it’s going in the field). We also worked on integrating functionality for cross-wiki search, the code is there already but we will have to figure out how to integrate results between wikis, etc. and when to enable it. Our progress can be watched at our sprint board: https://phabricator.wikimedia.org/tag/discovery-search-sprint/ (I admit, it is not the most easy to navigate) and on the team page:https://www.mediawiki.org/wiki/Wikimedia_Discovery. These topics – both multi-lingual search and inter-wiki search – are not abandoned, and is well on our radar, though it takes time to have impact. We also working on other search aspects – such as new and improved (faster and more accurate) completion suggester.
And we are looking into search result and how to measure user satisfaction and improve it right now. That’s another topic on our radar right now.

Also, nobody is trying to “seize” anything – the portal was moved to the same platform all other parts of Wikimedia code are already being hosted on – Gerrit. This is open collaboration platform used for all website code for all Wiki* sites, and all participants have access to it. This was done with support of the people actually maintaining the portal. Now the team is working on improving the design, the search and other aspects of the portal, based on actual usage data (e.g. A/B tests, surveys, etc).

BTW you are welcome to visit #wikimedia-discovery IRC anytime and ask questions if you’re interested.
- Nemo says:
  
  February 17, 2016 at 08:46
  
  As I said, good work is being done and most of the work done is good. However, the main community requests in your area of work were marked lowest priority by your PM, there is no questioning about this. If you think his triaging doesn’t match the reality of what you are doing, complain with him, not me.
  - Trey says:
    
    February 17, 2016 at 16:13
    
    Disclaimer: I’m part of the WMF Discovery team. (I am that “some new hire” guy who likes to look at multilingual search results.)
    
    Nemo, could you point to some source for the claim that interwiki search is the main community request for search? One of the Phab tickets you linked to ( https://phabricator.wikimedia.org/T3837 ) has in the description that it’s #73 out of 107 proposals (with 8 votes, compared to the 60-110 votes the top 10 got— https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey#The_top_10_wishes ). On the Search sub-page of the wish list survey (also linked from Phab— https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Search ), there are 5 search-related items, and interwiki search got the fewest votes (and the only opposing vote on the page).
    
    Interwiki search is computationally expensive, and a big project to do right. Enabling it might be as easy as flipping a switch, but the back-end performance costs are currently prohibitive with so many projects, some of which—the bigger Wikipedias—are huge. Some technical detail is available in the Discovery mailing list archives in this thread: https://lists.wikimedia.org/pipermail/discovery/2015-November/000500.html
ckoerner says:

February 17, 2016 at 15:51

Hey Liam,
I’m sorry we’re meeting under such circumstances. I really hoped that the foundation would have been better aligned with the communities about this grant and the work the Discovery team has been tasked with. I’m new to the foundation (still in my first month) and come from the community. Particularly MediaWiki and a little English Wikipedia. Since I’ve joined as a community liaison I’ve shared my concerns about that lack of transparency and early collaboration with this grant. Moving forward I hope you and others reading this will engage with me and the rest of the team regarding discovery-related projects. I will do my best in my role as a staff and as an individual supporter of the movement to make sure we’re engaging early and with transparency.

If you would humor me I’d like to try and clarify a few points you bring up.

“This document specifically and overtly states that its purpose is to start work on a search engine in opposition to Google/Yahoo.”

The grant mentions Google once in its entirety. The direct language from the grant is as follows:

“Risks: Two challenges could disrupt the project: 1. Third-party influence or interference. Google, Yahoo or another big commercial search engine could suddenly devote resources to a similar project, which could reduce the success of the project. This is the biggest challenge, and an external one.”

Why is Google listed as a risk? Because they could, as an example, easily build a competing platform to index and present all the free knowledge across Wikimedia projects and more. Google could build a “Wikisearch” or whatever and upset all the work the foundation and volunteers have done, and are doing, around improving search within Wikimedia projects. They’ve done things like this before and have the resources to make it more successful than we could do – with our entire fiscal budget, much less a few hundred thousand dollars a year.

As for the vague language, well the plan started much larger and as the scope changed the language did not. Also worth mentioning that the Knight Foundation is not a technical organization. There’s a little bit of salesmanship in grant making.

You’re right that part of the reason the foundation is investing in discovery is form external factors and the evolving landscape of mobile. Keeping up is hard if you don’t pay attention.

I wasn’t at the foundation at the inception and can’t find any documentation of such, but the general gist as I understand it is that the former CTO had a grand vision to take on Google. A big, hairy audacious goal that would have needed some substantial funding. The Knight Foundation came along. I don’t know if as a result of this goal or just around the same time. The leader left, things got shuffled and the WMF continued to engage with the Knight Foundation. The scope was reduced, the language was not modified as well as it could have been, and the grant was accepted. Here we are today. /throws confetti.

It sucks, we’re learning our lesson and I would have much rather engaged with you and the rest of the Wikimedia community under better terms. I joined the foundation because I wanted to spend more time building a community around the amazing work volunteers such as yourself contribute.

There is a blog post up now that helps clarify a little as well. I’m sure you’ve read it. Brion clarifies a few points that I too have heard. I’ve also shared similar clarification as well. http://blog.wikimedia.org/2016/02/16/wikimedia-search-future/

One additional bit of good faith feedback.

You use the word “seems” quite a few times in your analysis. As a fellow writer I’d encourage you to pick a stronger choice. “Seems” can be considered by many to be a weasel word – https://en.wikipedia.org/wiki/Weasel_word. Should anyone make any counter-claim, weasel words give you a way out. I don’t think ambiguity is your goal here, but clarity. At least, it ’seems’ that way.🙂
- wittylama says:
  
  February 17, 2016 at 16:42
  
  Hi Chris,
  as I have attempted to explain, my critique is not of the work that is currently being done (or the scope of that work) by the Discovery team. There is a well recognised need for improved search (broadly construed) in Wikimedia.
  
  My first concern is with HOW this project could have been created in the first place in its all-singing-all-dancing version without a strategic plan in place and deliberately kept off the books of the current annual plan. The Discovery team was hired *prior* to the grant being awarded, meaning that there was a strategic investment in the topic before funding was secured – and clearly there was a lot LESS funding secured than originally hoped for. This has caused significant shuffling across the organisation to account for this budgetary shortfall. Maybe the ‘larger’ version was a good idea, but it’s simply not appropriate to pitch such a concept, at that scale, without the organisation clearly aligned behind it. Furthermore, I don’t buy the argument that this was all Damon’s fault and then things got better when he left. He was not acting alone. Either the Board had approved this initial conceptualisation – despite a lack of strategy, or they weren’t fully aware of what was being said in their name. Neither is acceptable for any non-profit, let alone one that is supposed to pride itself on transparency.
  
  My second concern is WHY this was done secretly and why, even at this stage of the game long after person who’s currently being blamed for the mess was gone, it remained so incredibly secret and obfuscated. Neither of those things is acceptable too. This is emblematic of a pattern of behaviour – it’s not just about 1 document. And it’s a shame that your team’s current work is caught up in it.
  
  As for “seems”, I deliberately included this as an indicator that the sentence in question is my supposition/extrapolation – as opposed to other sentences that were citing facts. I think I’m right (otherwise I wouldn’t say it!) but I wanted the reader to be clear where it was my own thoughts interjecting into the narrative.
  
  And finally, since we’re talking weasel words, phrases like “moving forward”, “engage with”, “reach out” are horrible bureaucratese neologisms. I’d strongly recommend not using such phrasal verbs – as we’ve previously discussed: https://www.mediawiki.org/wiki/Talk:Wikimedia_Discovery/RFC#Comprehensibility
  - ckoerner says:
    
    February 19, 2016 at 16:07
    
    Fair enough. Have you seen the FAQ that Lila setup? If you feel like you’re not getting answers to your questions it’s a good place to engage wi… work with us on clearing the air around this.🙂
    
    https://meta.wikimedia.org/wiki/Knowledge_Engine/FAQ
  - Kevin Smith says:
    
    February 19, 2016 at 17:16
    
    As someone who worked with the Search team before it became the Discovery Department, I still struggle with the notion that the department was staffed up from the start to take on some massive project. Nobody was “hired” at that point–people were just shifted in from other teams.
    As I remember it, in early April (before the re-org that created the Discovery department), 3 developers were already working on some combination of elastic search and Wikidata Query Service (WDQS). One more was added to work on search as part of the re-org. 2 developers who had already been working on maps were brought in under the “Discovery” umbrella. The department also got a much-needed Product Manager and Engineering Manager to coordinate the work (previously the search, WDQS, and maps work had little guidance). And they picked up a data analyst and a UI designer, which made sense given the new “vertical” orientation, where each department was supposed to be able to operate mostly autonomously. As a member of Team Practices, I was assigned to spend part of my time working with them as an Agile Coach.
    So looking back, there were roughly 3 developers working on “search” (not counting WDQS, which at that time was seen as quite distinct, although with an eye toward eventually being able to use it to potentially answer direct questions). To me, that doesn’t seem like an unreasonable number just to maintain and improve the search feature of a widely-used platform (mediawiki) and a top-10 family of web sites (wikimedia).
    We now know that at that point there were some big dreams happening at the executive level. But I don’t see evidence that they had manifested themselves in terms of staffing.
    Also, I am not/was not aware of “significant shuffling across the organization to account for this budgetary shortfall.” All teams throughout the foundation were at one point encouraged to dream big and submit an optimistic budget, and then right before the annual plan was finalized, everyone had to pull back to a realistic budget with minimal year-over-year growth. That was frustrating and painful, and I suppose it’s possible that it was somehow related to “this” shortfall, but but it’s not clear to me that it was.
Pingback: Wikipedia : le projet de moteur de recherche ne fait pas l'unanimité | Internet news
Pingback: Un moteur de recherche Wikipédia secrètement développé ? - Agence de référencement SEO à Madagascar
Pingback: O projeto secreto da Wikipédia está deixando seus colaboradores confusos | Paraíba Já
Pingback: » Search and Destroy: The Knowledge Engine and the Undoing of Lila Tretikov The Wikipedian
Pingback: Wikipedia Se Divide Tras Filtrarse Sus Planes De Crear Un Buscador Para Competir Con Google | CA Periodista Noticias| Centroamérica Noticias
Pingback: Visto nel Web – 223 | Ok, panico
Pingback: Il nuovo motore di ricerca che sta distruggendo Wikipedia – Lo Spillo – Le Notizie più Virali del Web
Pingback: Wikimedia-Chefin Lila Tretikov tritt zurück | Wiki-Watch-Blog

Comments are closed.