We’re in the business of natural language processing with lots of different languages. So far we’ve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo.

    Natural language processing (NLP) is about finding patterns in language—for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it’s very English-centric. English is far and away the language that linguists have worked on the most and it’s also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse  data, the more confident that you can be in its ability to handle unseen data.

    To this end, we might choose to define “weirdness” in terms of English. But that’s a pretty irritating definition. Let’s try to do something different.

    A global method for linguistic outliers

    The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total.

    So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature. For example, English word order is subject-verb-object—there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verb—like Welsh, Hawaiian and Majang—so cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. (Aside: I’ve done some work with Hawaiian and Majang and that’s how I learned that verbs are a big commitment for me. I’m just not ready for verbs when I open my mouth.)

    The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these—dropping us down to 1,693 languages).

    Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALS—there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, we’d like to judge weirdness based on unrelated features. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.

    For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages).

    The outlier (weirdest) languages

    The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets (that’s where we get the word parka from). Number three is Choctaw, spoken by about 10,000 people, mostly in Oklahoma.

    But here’s the rub—some of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin.  And actually English is #33 in the Language Weirdness Index.

    The weirdest languages in the world

    The 25 weirdest languages of the world. In North America: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, and Zoque; in South America: Paumarí and Trumai; in Australia/Oceania: Pitjantjatjara and Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, MumuyeJu|’hoan, and Khoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and Mandarin; and in Europe: German, Dutch, Norwegian, Czech, and Spanish.

    By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you guess which one of the internal syllables is silent?)

    Questions and pronouns: two example features

    This is odd. Is this odd? One of the features that distinguishes languages is how they ask yes/no questions.The vast majority of languages have a special question particle that they tack on somewhere (like the ka at the end of a Japanese question). Of 954 languages coded for this in WALS, 584 of them have question particles. The word order switching that we do in English only happens in 1.4% of the languages. That’s 13 languages total and most of them come from Europe: German, Czech, Dutch, Swedish, Norwegian, Frisian, English, Danish, and Spanish.

    But there is an even more unusual way to deal with yes/no questions and that’s what Chalcatongo Mixtec does: which is to do nothing at all. It is the only language surveyed that does not have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement. I have spent part of the day imagining a game show in this language.

    Another thing languages have to deal with is what to do with simple subjects like I, they, or it. These are called pronominal subjects (something like The minister prevaricated has a nominal subject). The most common way to do this is to just tack the information about the subject on to the verb—437 out of 711 languages do this, like Spanish, Italian, and Portuguese. But Dutch, German, and Norwegian—like English—prefer having special subject pronouns that are normally/obligatorily present. But this is only done by 82 of the 711 languages coded in WALS. Kutenai (100 speakers in British Columbia, Canada) and Mumuye (400,000 speakers in Nigeria) do something even more unusual: they have something like subject pronouns but these go in different positions in the syntax than where full noun phrases go. And even more unusual than this is Chalcatongo Mixtec again: they combine several strategies so they have both subject markers that they add to verbs and they have pronoun words, too. But these pronoun words appear in a different spot from where a full noun phrase would show up.

    The 5 least weird languages in the world

    Now if I asked you to consider these languages, how weird would you say they were? Lithuanian, Indonesian, Turkish, Basque, and Cantonese. Surprise! They are really low on the Weirdness Index. They don’t seem typical to linguists and language learners but for these 21 features they stick with the crowd. Notice that we get isolates (like Basque) distributed throughout levels of Weirdness. Basque is “typical” but Kutenai, another isolate, is one of the weirdest of all languages. Even more surprising is that Mandarin Chinese is in the top 25 weirdest and Cantonese is in the bottom 10. This has to do with the fact that they have different sounds: Mandarin, unlike Cantonese has uvular continuants and has some limits on “velar nasals” (like English, Mandarin can have a sound like at the end of song but it can’t have that sound at the beginning of words—worldwide it’s rare to have that particular restriction).

    At the very very bottom of the Weirdness Index there are two languages you’ve heard of and three you may not have: Hungarian, normally renowned as a linguistic oddball comes out as totally typical on these dimensions. (I got to live in Budapest last summer and I swear that Hungarian does have weirdnesses, it just hides them other places.) Chamorro (a language of Guam spoken by 95,000 people), Ainu (just a handful of speakers left in Japan, it is nearly extinct), and Purépecha (55,000 speakers, mostly in Mexico) are all very normal. But the very most super-typical, non-deviant language of them all, with a Weirdness Index of only 0.087 is Hindi, which has only a single weird feature.

    Part of this is to say that some of the languages you take for granted as being normal (like English, Spanish, or German) consistently do things differently than most of the other languages in the world. It reminds me of one of the basic questions in psychology: to what extent can we generalize from research studies based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated Industrialized Rich and Democratic. In other words: sometimes the input is WEIRD and you need to ask yourself how that changes things.

    You’re weird

    Even though the methods here don’t define things in terms of English, they still smuggle in some cultural-specificity. That is, the linguists who developed and annotated the features were mostly speakers of European languages. What features might a person from Papua New Guinea or Ethiopia or the Amazon have come up with instead? And of course, WALS doesn’t have any data at all on about 4,000 languages. And the languages that it has the most data for are not truly random.

    Despite this, English still ranks as highly unusual (it comes in as #33 with an index value of 0.756). That English-speaking brain you’ve been using to read this? It’s wired weird.

    – Tyler Schnoebelen (@TSchnoebelen)

    Appendix: The tops and bottoms

    Here are the values for the top and bottom 10 languages. You might also check out our posts on:

    Rank

    Language

    Weirdness Index

    1

    Mixtec (Chalcatongo)

    0.972

    2

    Nenets

    0.935

    3

    Choctaw

    0.924

    4

    Diegueño (Mesa Grande)

    0.920

    5

    Oromo (Harar)

    0.919

    6

    Kutenai

    0.908

    7

    Iraqw

    0.900

    8

    Kongo

    0.883

    9

    Armenian (Eastern)

    0.861

    10

    German

    0.858

     

     

    230

    Basque

    0.189

    231

    Bororo

    0.153

    232

    Quechua (Imbabura)

    0.151

    233

    Usan

    0.151

    234

    Cantonese

    0.143

    235

    Hungarian

    0.132

    236

    Chamorro

    0.128

    237

    Ainu

    0.128

    238

    Purépecha

    0.100

    239

    Hindi

    0.087

     

    Update: Here is the full list, with the 21 weirdness features and all of the languages that had values for at least one of them (don’t trust those values, of course).

    Weirdness_index_values_full_list

      Tyler Schnoebelen

      Tyler finds the patterns in data that make it meaningful. He has ten years of experience in UX design/research in Silicon Valley and a PhD from Stanford. His work there included experimental psycholinguistics, fieldwork on endangered languages, and a dissertation on emotion (he got his BA at Yale studying playwriting and poetry). His insights on social media have been featured in The New York Times Magazine, The Boston Globe, The Atlantic, and NPR. He is incorrigible.


      Next Read
      Economic powerhouse languages

      The languages that shape the world’s economies: an overview of which ones are in the best and worst position for NLP.

      Read More
      Next Read
      Entrepreneurs and empresarios: trends in English, French, and Spanish

      English posts on entrepreneurs are rosy. In Spanish, there’s a lot of negativity. And in French…well, the loudest trend is an absence.

      Read More
      Next Read
      Languages at ACL this year

      The largest annual conference in computational linguistics is in Beijing this year. Over the next few days, the world’s top researchers will present their latest research. Here are the languages they are studying: Calculating the languages studied We took the languages mentioned in the ~300 abstracts, counting each one named. When multiple were named, we…

      Read More

      146 thoughts on “The weirdest languages

      1. I love this article! I had been wondering about this. Being Dutch myself, and teaching Dutch to immigrants, I had been wondering also if the mistakes that are still made typically by immigrants who have learned Dutch thoroughly and have been here for a long time, define some of the weirdness of Dutch ( mixing up the definite articles ‘de’ and ‘het’ and the use of the elusive word “er”).

      2. Wonderful post. Truly investigative NLP research. In my research, I’ve found way too many knowledge mapping folks bound by English.

      3. You have done wonderful work. So many useful observations.
        I am probably one of the few people who has ever done more digging over a longer period at language usages around the world. Not focused like you, of course.
        My interests are less scientific, more future oriented. My major comment is that you do not take into consideration the success of languages.
        My sin, from the view of most linguists, is to believe that some languages are more effective than others and that an evolutionary process can be involved.
        Take SOV vs. SVO. You get a very different weirdness result if you count the number of speakers (not count all the little endangered languages as equal to Mandarin, English, Spanish, Russian, Portuguese, etc.).
        There have been trends. The descendants of Latin are pretty much all SVO. Archaic Chinese was SOV — and Cantonese remains so. But Classical Chinese and now Mandarin and most dialects are SVO. The wave of the future in China is clearly SVO. Are languages becoming weird? I doubt it.
        If you apply a nose count to your languages, what language patterns then become the most characteristic of success? What structures are then shown as left behind — as ‘weird’ and less usual, if not failed ways of saying things?

        • Tyler

          What counts as success in language? It’s probably not possible to show that one language is less expressive than another. Is one better at romancing, childcare, gathering, warfare? (“I would have had twelve children and they would have survived and we would’ve together conquered new lands…if only I hadn’t been forced to start my sentences with a verb!”) Evolution arguments sometimes go towards the brain and one might ask about “cognitive load” but even if verb-initial languages are hard for me, personally, they don’t seem to be for native speakers. And thinking evolutionarily, there are all sorts of frills that tax living organisms and we often talk about these as adaptive so even if I granted a higher cognitive load, does that necessarily translate to problems?

          To take it back to word order: there are Indo-European languages that are verb-initial (Welsh, Gaelic, Irish, Breton). Dutch, German, Bulgarian, and Greek are said to have no dominant word order. When in history do people gain and lose their language advantages? And of course probably all languages have moments where they put verbs first. In English, consider questions and imperatives. At first blush, those seem like very important speech acts for evolutionary arguments to consider.

          Certainly we have all been in situations where we have felt more-or-less effective in our communication. And while there’s going to be variation for everyone, it may well be that some individuals tend to be more effective in communicating than others. When you talk about communicative effectiveness at a language-level, it becomes this interesting question about how individual actions and overall structures interact (there’s clearly an interplay: structures are built out of individual choices, but individual choices are constrained/shaped by the structure of the system they come from).

          I think this is at the heart of the issue: the reasons why individuals and communities thrive or die are hugely complicated and intersected. Given all the historical contingencies (“I got to be born in a fertile plain, hooray”, “there are no mammals here to domesticate, #sadface”), how would we tease out the role of language structure, which even a supporter of the evolution-effectiveness hypothesis would have to grant is likely to be dwarfed by all the other factors going on.

      4. Tyler

        What counts as success in language? It’s probably not possible to show that one language is less expressive than another. Is one better at romancing, childcare, gathering, warfare? (“I would have had twelve children and they would have survived and we would’ve together conquered new lands…if only I hadn’t been forced to start my sentences with a verb!”) Evolution arguments sometimes go towards the brain and one might ask about “cognitive load” but even if verb-initial languages are hard for me, personally, they don’t seem to be for native speakers. And thinking evolutionarily, there are all sorts of frills that tax living organisms and we often talk about these as adaptive so even if I granted a higher cognitive load, does that necessarily translate to problems?

        To take it back to word order: there are Indo-European languages that are verb-initial (Welsh, Gaelic, Irish, Breton). Dutch, German, Bulgarian, and Greek are said to have no dominant word order. When in history do people gain and lose their language advantages? And of course probably all languages have moments where they put verbs first. In English, consider questions and imperatives. At first blush, those seem like very important speech acts for evolutionary arguments to consider.

        Certainly we have all been in situations where we have felt more-or-less effective in our communication. And while there’s going to be variation for everyone, it may well be that some individuals tend to be more effective in communicating than others. When you talk about communicative effectiveness at a language-level, it becomes this interesting question about how individual actions and overall structures interact (there’s clearly an interplay: structures are built out of individual choices, but individual choices are constrained/shaped by the structure of the system they come from).

        I think this is at the heart of the issue: the reasons why individuals and communities thrive or die are hugely complicated and intersected. Given all the historical contingencies (“I got to be born in a fertile plain, hooray”, “there are no mammals here to domesticate, #sadface”), how would we tease out the role of language structure, which even a supporter of the evolution-effectiveness hypothesis would have to grant is likely to be dwarfed by all the other factors going on.

      5. Tyler

        A few items from Facebook and Twitter.

        @DrDawg asked about Pirahã (down in the Amazon, said to lack recursion, you might have seen press on it)–it gets a high weird value of 0.74. That makes it #38 for well-attested languages, but English is weirder (#33, 0.76). Nez Perce may also lack recursion, but this is not something that is coded in WALS yet since I believe they are interested in features with a bit more variation. But even without this feature our method here does discover it as a highly odd language.

        People guessed Athabaskan and Caucasian languages are weird, here are the numbers:
        – Some Athabaskan languages are pretty weird (Hupa=0.72, Chipewyan=0.61) but some aren’t (Navajo=0.54, Slave=0.28).
        – Northwest Caucasian languages are also weird, like Abkhaz on the map (0.84), Kabardian and Ubykh only have values for 9 features but they have values of 0.98 and 0.88, respectively.

        People guessed Hungarian as weird, but it’s not (see above). They also guessed its distant cousin, Finnish, which is also not weird, though it is weirder than Hungarian (0.47).

        Hindi is super-not weird, as reported above. Nor are Turkish or Basque, which people also guessed as weird.

        @sapniic asked about Lithuanian and Latvian: Lithuanian=0.26, while Latvian=0.49. That’s because Latvian does strange things with tense/aspect and with pronominal subjects.

        @StuartRobinson asked about Rotokas, but there’s not a lot of information in WALS about it. One thing that does make it weird is that it has a “th” sound.

        Some folks on Facebook guessed that Romance languages would be “not weird”. By these measures, they are a mixed bag: (Spanish=0.79, French=0.75, Italian=0.35, Romanian=0.30, Portuguese=0.17).

        Report your guesses, ask any questions!

      6. Can you identify the 21 features you ended up using?

        Also, how do you deal with languages that have various options (e.g. French has both a particle-like expression “est-ce que” and inversion) or other complications (e.g. Hebrew is partially pro-drop, only in some tenses and only in some persons)?

        • Tyler

          Yep, added the spreadsheet to the bottom of the post.

          WALS researchers created strict definitions for coding. They may not always be what you or I would choose, but they do work hard to make them consistent and reasonable. Here’s a link to how yes/no questions are done (French is listed as having a particle by WALS): http://wals.info/chapter/116

      7. miko sloper says:

        i am also interested in the future of languages, and how they survive, compete and flourish. this weirdness scale is a lovely insight into languages.
        i wonder how Esperanto would fit on this scale. it was designed to be easy to learn: does that map onto “not weird”?

        • Tyler

          I agree–this would be great to get Esperanto on to. And although a major reason behind Esperanto was to create a global language where no one had a leg up (e.g., I have an advantage in the world because English is so powerful and I’m a native speaker). But Esperanto is clearly heavily heavily heavily European in its structure. Because it was kind of seen as “easy to learn for Westerners” would be the same as “easy for anyone to learn”.

      8. You seem to rely pretty heavily on the list you created, but if there are so many big surprises, are you sure you counted in every aspect that make a language really weird?

        • Tyler

          It’s the surprises that make it fun, though, right? I am limited by the fact that I chose WALS, which is as far as I know the most comprehensive record of language structures around the world. But they don’t tend to choose features that have “99% of languages doing one thing and then one language doing something totally different”. A weirdness index made up of those features would certainly be interesting.

      9. That’s great.

        I’d love to see the full 239 item list; just knowing the top and bottom 10 (and knowing English ranks #33) is not enough!
        I want to be able to tell people how weird or normal their language is.

      10. One thing that does stand out to me is that French is not counted as one of the languages that inverts subject and verb for the question. Perhaps they didn’t count it because the language also has a tag question option for asking questions with “est-ce que.” However, the tag question “est-ce que,” which basically means “Is it that” is an inversion in itself, altered from “C’est que” or “It is that.” At what point does “est-ce que” transition from an inversion tag to simply a tag question? How does one draw the line?
        Also, how did the researchers decide to handle languages that have both options for questions?

        • Tyler

          There are definitely lots of ways to ask questions in French (and other languages). The WALS researchers took pains to come up with coding schemes that, even if contestable, were consistent.

          French is classified as having a question particle. You can read about that here: http://wals.info/chapter/116

      11. Guy Mcilroy says:

        Wonderful article, what about Sign Languages? how weird are these? maybe look at American Sign Language/South African Sign Language? i am interested in what you find out. Btw, SASL, my language, uses SOV word order: ME BOOK NEED.

      12. “They also guessed it’s distant cousin, Finnish” -> It’s != Its
        Anyway, does anyone know where I can see the whole list?

      13. I am feeling very proud that Hindi is the least weird language with a weirdness index of just 0.087 with only one weird feature!!

      14. Paul Vinkenoog says:

        Having different articles for the grammatical genders is definitely not a ‘weird’ feature. ‘Er’ is trickier. Sometimes it must be translated as ‘ci’ (Italian) and ‘there’ (English), sometimes as another word and often it must simply be left out in translation. The point is that many people just can’t let go of the structures and peculiarities of their native language and will never speak a new language correctly. Others start to ‘feel’ the language after a while and speak like a native, regardless of differing grammatical features, complexity or ‘weirdness’.

        • Tyler

          Ah, the feature about gender that is part of the calculation isn’t about grammatical gender of articles but of gender distinctions in independent person pronouns (is there “he” vs. “she”, etc). And most languages *do not* make a division. http://wals.info/feature/44A

          (Fwiw, my main fieldwork language, Shabo, makes a distinction not just for third person, but for first and second person, and not just for singular but for dual and plural, too.)

      15. Very interesting indeed. I see that Latin and classical Greek are not included in WALS, although very much is known about their structure, probably more than about any other pair of languages (research history of over 2000 years). Moreover, these languages form a substantial part of the curriculum of the grammar school type (‘Gymnasium’) in European countries. It is often suggested that learning these languages has an additional value as an intellectual challenge – apart from getting acquainted with ancient culture. It would be interesting for pupils and teachers to known the W-index of these languages, in particular in comparison with the index of their mother tongue.

      16. I find it very weird that Norwegian is there while Danish and Swedish are not, since they are very, very similar (especially Norwegian and Danish). For me as a native Swede, I can not always distinguish written Danish and written Norwegian while I can usually perfectly make out what they say. For me, at least as writing goes, Norwegian lies between Danish and Swedish. As far as spoken languages go, Danish is far more werid. Maybe the geographical location of Denmark makes it more heavily influenced by German, Dutch and French, and thus not as much of an outlier?

        • Tyler

          Actually, Swedish and Danish ARE very weird, but they didn’t make my cut-off of “14 or more of the 21 features attested”. Swedish has 12 of the 21 features listed in WALS, Danish has 13. Both of them are actually weirder than Norwegian if we look the other way about the data sparsity:

          Swedish: 0.86
          Danish: 0.85
          Norwegian: 0.82

          (And I’m totally with you on a personal sense of the weirdness of spoken Danish–I studied it for a teeny tiny sliver of time but I couldn’t make my mouth say “American” in anything like a convincing way.)

      17. Fascinating read. Did your study include the San languages of the Kalahari (Namibia and Botswana), such as !Kung (or !Xuun)? These languages/dialects appear (to my untrained ear) to have nothing in common with anything I’ve heard before. I’d love to know where they sit on the weirdness scale.

        • Tyler

          There are four Khoisan languages that have 10+ values:

          Ju|’hoan (20 of the 21 features filled in): 0.83
          Sandawe (11 of the 21): 0.83
          Khoekhoe (17 of the 21): 0.83
          Korana (11 of the 21): 0.65

          So all of them turn out to be in the “highly weird” and not just because of their clicks (which are unusual, as your ear tells you).

      18. Missing word:

        (second paragraph in the Questions and Pronouns section)

        But there is an even more unusual way to deal with yes/no questions and that’s what Chalcatongo Mixtec does: which is to do nothing at all. It is the only language surveyed that does **NOT** have a particle, a change of word order, a change of intonation…There

      19. @LeoMoser

        I’m a native Cantonese speaker and I’m struggling to think of a valid SOV sentence. It is definitely SVO like Mandarin chinese. I can’t comment on arcahic chinese as I’m not educated in classical chinese.

        • Tyler

          I am not really that familiar with the “Basque is a Caucasian language” hypothesis, but I am skeptical. It is unlikely that the values here will answer the question: the best way to prove a relationship is to show that two languages have words that are related. That doesn’t mean finding words that look the same, it means finding words that are systematically different from one another, so that you can say “ah, there was a protoword Foo but in Language X all the f’s turned to p’s and and oo’s turned to ee’s, while in Language Y, the f’s turned to b’s and the oo’s turned to ow’s and we can show it for multiple words”.
          But as far as the data show–Abkhaz is very weird (0.84) while Basque is very not-weird (0.19). Basque isn’t weird in any of the features that Abkhaz is weird in.

      20. Matt Haggard says:

        Great article! You have a fun job.

        Two typos:

        1. Change “an” to “a”:
        s/with an Weirdness Index/with a Weirdness Index/

        2. Eliminate one “are” surrounding “as Joseph Henrich and his colleagues argue”
        s/based on university students who are, as Joseph Henrich and his colleagues argue, are Western Educated/based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated/

        • Tyler

          Tamil has 12 of the 21 values listed in WALS. It’s score is 0.37, which is a pretty low weirdness (the median weirdness score for languages with 12+ values is 0.53, the bottom quartile is at 0.39–so Tamil is down the list a ways).

      21. I think you’re missing a ‘not’ in here: “Chalcatongo Mixtec […] is the only language surveyed that does [not] have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement.”

      22. My biggest concern is the big difference in “weirdness” de between Spanish and Portuguese, given that they are both genetic and typologically extremely close. Anyway, great great article!! (as speaker of Basque I fell somewhat proud too)

      23. Tyler

        On the one hand, related languages can diverge in weirdness because the features chosen here are not meant to be correlated, but looking more closely at Spanish and Portuguese, my guess is that it’s about sparseness in Portuguese. While all 21 of the 21 features are filled in for Spanish, only 12 of them are filled in for Portuguese.

        3 of Spanish’s 4 weirdest features are simply blank for Portuguese:

        44A: Gender Distinctions in Independent Personal Pronouns
        111A: Nonperiphrastic Causative Constructions
        19A: Presence of Uncommon Consonants

        There is one weird feature for Spanish that is filled in for Portuguese. This is the “polar question” thing discussed in the body of the post: Spanish moves things around (weird), while Portuguese does the “normal” thing.

        116A: Polar Questions

      24. Tyler

        @Jonathan Yep, added the spreadsheet to the bottom of the post.

        WALS researchers created strict definitions for coding. They may not always be what you or I would choose, but they do work hard to make them consistent and reasonable. Here’s a link to how yes/no questions are done (French is listed as having a particle by WALS): http://wals.info/chapter/116

      25. Tyler

        @ miko: I agree–this would be great to get Esperanto on to. And although a major reason behind Esperanto was to create a global language where no one had a leg up (e.g., I have an advantage in the world because English is so powerful and I’m a native speaker). But Esperanto is clearly heavily heavily heavily European in its structure. There was likely an implicit bias that “easy to learn for Westerners” would be the same as “easy for anyone to learn”. [Note: I don’t actually think the Weirdness Index is likely to correlate strongly with “learnability” since how learnable Language X depends upon a lot of more important factors like motivation and similarity to languages you do speak.]

      26. Tyler

        @Lewistrick: It’s the surprises that make it fun, though, right? I am limited by the fact that I chose WALS, which is as far as I know the most comprehensive record of language structures around the world. But they don’t tend to choose features that have “99% of languages doing one thing and then one language doing something totally different”. A weirdness index made up of those features would certainly be interesting.

      27. Tyler

        @mare and @Pedro: Added to the bottom of the post!

        @Guy: Sadly, sign languages don’t make it into this list. WALS does have two features specifically for sign languages: http://wals.info/chapter/139 and http://wals.info/chapter/140 (on question particles and irregular negatives, respectively).

        @Victoria: There are definitely lots of ways to ask questions in French (and other languages). The WALS researchers took pains to come up with coding schemes that, even if contestable, were consistent.

        French is classified as having a question particle. You can read about that here: http://wals.info/chapter/116

        @Paul: Ah, the feature about gender that is part of the calculation isn’t about grammatical gender of articles but of gender distinctions in independent person pronouns (is there “he” vs. “she”, etc). And most languages *do not* make a division. http://wals.info/feature/44A

        (Fwiw, my main fieldwork language, Shabo, makes a distinction not just for third person, but for first and second person, and not just for singular but for dual and plural, too.)

        @Aris: (Come back on Monday–we’ll have a post focused on Greek and Latin prefixes!)

        @Robert: Actually, Swedish and Danish ARE very weird, but they didn’t make my cut-off of “14 or more of the 21 features attested”. Swedish has 12 of the 21 features listed in WALS, Danish has 13. Both of them are actually weirder than Norwegian if we look the other way about the data sparsity:

        Swedish: 0.86
        Danish: 0.85
        Norwegian: 0.82

        (And I’m totally with you on a personal sense of the weirdness of spoken Danish–I studied it for a teeny tiny sliver of time but I couldn’t make my mouth say “American” in anything like a convincing way.)

        @Jacques: There are four Khoisan languages that have 10+ values:

        Ju|’hoan (20 of the 21 features filled in): 0.83
        Sandawe (11 of the 21): 0.83
        Khoekhoe (17 of the 21): 0.83
        Korana (11 of the 21): 0.65

        So all of them turn out to be in the “highly weird” and not just because of their clicks (which are unusual, as your ear tells you).

        @Ninju, @Matt and @Ben: Got it, fixed. Thanks for the careful reading!

        @Sylvia: (WALS encodes Cantonese as SVO, too.)

        @Abkhazian: am not really that familiar with the “Basque is a Caucasian language” hypothesis, but I am skeptical. It is unlikely that the values here will answer the question: the best way to prove a relationship is to show that two languages have words that are related. That doesn’t mean finding words that look the same, it means finding words that are systematically different from one another, so that you can say “ah, there was a protoword Foo but in Language X all the f’s turned to p’s and and oo’s turned to ee’s, while in Language Y, the f’s turned to b’s and the oo’s turned to ow’s and we can show it for multiple words”.

        But as far as the data show–Abkhaz is very weird (0.84) while Basque is very not-weird (0.19). Basque isn’t weird in any of the features that Abkhaz is weird in.

        @kumar: Tamil has 12 of the 21 values listed in WALS. It’s score is 0.37, which is a pretty low weirdness (the median weirdness score for languages with 12+ values is 0.53, the bottom quartile is at 0.39–so Tamil is down the list a ways).

      28. Gabriel Svoboda says:

        Excellent list!

        Just a side note – even though I appreciate it is globally neutral, it would be nice too to be able to “center” the list on given language and get a list of languages by their similarity to language X.

      29. In reference to your first listing of the 25 weirdest languages: Spanish is spoken in South America, so I think you should list it there first with the other South American weird languages. I was waiting to see it there and was kind of thrown for a loop when it didn’t show up, as if it’s only spoken in Europe and not in South America.

      30. Oh cool! This study is fun! And thanks for providing the Excel file. Could you show what the list would look like if you removed the phonological features from the equation? Meaning, take out “uvular consonants,” “velar nasal,” “fixed stress locations,” and “presence of uncommon consonants,” and measure “weirdness” based on the other 17 features? I’d like to see the results :)

      31. Ken Litkowski says:

        Great discussion. Please submit to Transactions of the Association for Computational Linguistics. I’d very much like to know the methodology you used. I am studying preposition behavior (sorry, English), with 48,000 unlabeled instances and desperately seeking a way to cluster. I have roughly 1400 features for each instance (mostly WordNet synonyms and hypernyms).

      32. The weirdest language in Europe is Swiss-German, German people don’t understand it (like if it’s were a dialect), even yourselves have trouble with some region. It has rules, but fuzzy one.
        German words get heavily trunked, some letters get exchanged or added. Some words have different meaning (saying became sawing), they’re, at least, 4 different ways to write/say yes !

        But weirdest at all , as many Dutch people told us, is that if we learn to speak Dutch, we are the only foreigners who’s Dutch sounds right!

      33. BostonLinguist says:

        In my heart, I’m sure the moral of the story is true, but be careful about trusting WALS for data about the world’s languages. It’s really not a trustworthy source. Almost every time I’ve looked into any interesting WALS claim in detail, it’s turned out either that the WALS editors read their sources carelessly or that the source itself is doubtful, or both. WALS is mostly a compendium of library research in 200 years of grammars written by people with wildly varying degrees of linguistic training. There’s no quality control. WALS also ignores almost every source written by a generativist for academic-political reasons, which cut them off from a huge literature on the world’s languages whose quality and astuteness is probably on average much higher than WALS’ sources. And often when you go back to WALS source – even putting aside doubts about its quality – it turns out that the WALS people misread it (probably because they were reading quickly) or just quotes the grammar’s conclusions without checking if the actual data supported them. For a recent example, look at footnote 7 of this paper, for example, http://ling.auf.net/lingbuzz/001822. but you can easily generate your own examples by just following up on the sources. And even when you can’t show that a claim is false or wrongly cited, you really want independent verification. Take the claim that “There is absolutely no difference between an interrogative yes/no question and a simple statement” in Chalcatongo Mixtec. It comes from a dissertation and book by Monica Macaulay that looks careful and well-written. But do we really know that there is no intonational difference? She says so, but the dissertation is based on fieldwork from the late 70s and 80s, before it was easy to do pitchtracks and really check that there’s no intonation difference. She’s apparently relying on her ears, but she’s not a native speaker and it’s possible there’s a difference she just didn’t hear. Her claim could be true, but I for one would suspend judgment until we’re really sure.

      34. Terry Collmann says:

        “how awesome of a name”

        How bizarre a construction is that? Yes, I know more and more young Americans stick “of” in “how [adjective] a [noun]” phrases, but not in my idiolect, baby, nor any of the other 65+ million speakers of English here in the British Isles. But I’m interested to know if you actually realise you’re using a very new construction, and you’ve noticed that others don’t.

      35. Here’s the full list of features, from the spreadsheet:

        83A: Order of Object and Verb
        87A: Order of Adjective and Noun
        143A: Order of Negative Morpheme and Verb
        143G: Minor morphological means of signaling negation
        69A: Position of Tense-Aspect Affixes
        116A: Polar Questions
        57A: Position of Pronominal Possessive Affixes
        101A: Expression of Pronominal Subjects
        6A: Uvular Consonants
        71A: The Prohibitive
        129A: Hand and Arm
        130A: Finger and Hand
        44A: Gender Distinctions in Independent Personal Pronouns
        14A: Fixed Stress Locations
        9A: The Velar Nasal
        72A: Imperative-Hortative Systems
        111A: Nonperiphrastic Causative Constructions
        64A: Nominal and Verbal Conjunction
        124A: ‘Want’ Complement Subjects
        117A: Predicative Possession
        19A: Presence of Uncommon Consonants

      36. Kate Lindsey says:

        Great work with some interesting results. Bernard Comrie (UCSB and MPI) published something very similar (in fact the same methods and very similar results) in 2011. Did you work together on this? Very surprising not to see him and his team mentioned.

      37. Interesting question. I am a German who has learned Dutch and although I cannot confirm having any trouble with using the “er” particle I still mix up “de” and “het”. As far as the error pattern is concerned I assume that it has to do with a lack of experience. There is an almost indefinite number of words in any language. In Dutch or German you have to know the corresponding article in all of these cases because they don’t always follow logical rules and there are so many exceptions. Yet, I suppose it’s earsier for Germans learning Dutch and Dutch people learning German because of the obvious similarities between our languages.

      38. The typological peculiarities of the Lithuanian verbal system rival if not surpass those of its Latvian counterpart. See for example the paper “On the aspectual uses of the prefix be- in Lithuanian” by Peter Arkadiev, available on academia.edu. The title alone suggests that WALS has got it wrong by classifying Lithuanian as using only suffixes to indicate tense/aspect (Feature 69a), thus making it seem simpler than Latvian. It’s unclear why Latvian is classified as “mixed”, but the source cited is the 1960s edition of “Teach Yourself”! WALS is a wonderful thing, but there are many glitches, as others have pointed out.

      39. Charles Hall says:

        Thanks, great article. However, there are some problem with your analyses because it’s based on faulty “data” ….

        English is actually weirder than you think. For example, English does invert verbs and subjects to make yes/no questions. German, Dutch, and sometimes French really do…. but English [with one or two tiny exceptions with “heavy” intros] REQUIRES that the main verb stay behind the subject…

        English used to invert VERBS but doesn’t now; not it only inverts the “tense” which can be linked to an an auxiliary. In fact if there is no auxiliary, one has to be created! But in all cases, the VERB stays right where it belongs behind the Subject!

        He dances the fandango with great vigor.
        S V tns O

        Does he dance the fandango with great vigor?
        aux tns S V O

        Can he dance the fandango?
        Modal S V

        Did he dance?

        aux tns S V

      40. Sérgio Rocha says:

        I am still amazed by the fact that languages so similar as Spanish and Portuguese can be so separated in the scale. As a Portuguese speaker, I can confortably read any text written in Spanish, and comprehend their conversation with little difficulty.

      41. How about weighting with populations? Or log(population) so that it does not turn to Mandarin vs. the rest.

      42. Ulrich Kampffmeyer says:

        Great, surprising work! I have been looking into language and cultural implications in regard to the use of words for terminology, taxonomy, classification schemes and ordering systems (for records management). Your article gives a brilliant new perspective. Ulrich Kampffmeyer

      43. Tyler

        Hey Charles,

        You are right that Old English did a really full on German-like verb inversion but that now it’s really auxiliary verbs that get put at the beginning of yes/no questions. This is called out in the WALS chapter (http://wals.info/chapter/116). It is a bit different than moving the main verb around, however, the coding scheme does seem fair in this case, lest you subdivide an already tiny category, which I think would obscure the fact that there is an unusual word order thing happening in English.

      44. Tyler

        Hey Sérgio: On the difference between Portuguese and Spanish, see my response above to Eleder.

      45. I can’t believe Hindi is not weird. Didn’t the fact that it is ergative in the non-past and accusative in the past tense (or the other way — I forget), and the totally weird, non-regular pattern in the numbers figure into this somewhere?

      46. This is some fascinating analysis you guys did, thanks for sharing. My thoughts:

        Many of these languages are missing feature data. You restricted the list to those languages which have at least 2/3 of their selected 21 features, but they don’t overlap between languages, so for all we know Hindi could have really weird “57A: Position of Pronominal Possessive Affixes” but since the data is missing it wins as “least weird.”

        I downloaded the sheet and restricted it to languages which have all 21 features filled in, to make it an even playing field, and Abkhaz, Spanish, and Chinese show up as the weirdest and Turkish and Hungarian the least weird. Go figure.

        I haven’t carefully read your description of the analysis, but I think the problem is that a language like Turkish or Hungarian could have a weird feature that is so uncommon that almost no other languages have it, and as such this feature is not considered as part of the 21 features. For example, the Turkish “miş” suffix which is used (among other things) to describe or infer something which the speaker does not have direct personal knowledge of, e.g. “He should have arrived at noon [but I am relaying information and cannot personally verify that he did, indeed, arrive.” I’ve never encountered anything like it in any other language and for me that would put Turkish among the weirder languages of the world (and it would probably get a much higher weirdness ranking if included as a feature in your analysis), but whatever.

        Perhaps more importantly, your rankings of weirdness are going to depend heavily on your chosen language universe; then depending on what’s defined as a language vs. a dialect, certain features could be overweight or underweight. For example, suppose some researchers of Amazon rainforest languages document 100 related languages which all happen to share one feature which English does not have. Then English appears weird relative to these languages, because the 100 languages get a score of 99% for that feature and English gets 1%. Obviously this is an extreme example, but simply by virtue of being consolidated and well-established, a language like English is probably going to do somewhat poorly in this type of ranking vs. the thousands of other language/dialects which have not undergone this consolidation but share features.

        I wonder what the results would look like if you weighted them by number of speakers? Probably, you would end up with a list that looks very similar to the list of languages by speaker population, but due to shared features it wouldn’t be exactly the same.

      47. Welsh isn’t always verb initial though, is it? For some verbs it is but not for others.

        Hoffwn i… is verb initial
        where
        Rydw i’n hoffi… is subject-verb.

        It’s been a long time since high school, mind.

      48. Michael Hanson says:

        Fascinating work, thanks!

        I’d love to see this data plotted on a phylogenetic tree of language families… the closer you are to the proto-Indo-European root, the more normal, perhaps?

        Similarly the word-order switching of the western Germanic languages points to an older, conserved, feature (but it is present in both Romance and Germanic languages, hm).

        You could also look at founder effects, as a novel (“weird”) language feature pops up in an isolated population!

      49. Pingback: Assorted links
      50. Isn’t doing the data-pruning you describe bias you against the truly weird candidates?

        i.e.”we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these)”

        Weird languages will have rare features and possibly have feature dimensions that won’t make it to lists.

      51. Bart Anderson says:

        @Tyler, thanks for the article and your comment about Esperanto. Mikael Parkvall wrote a scholarly article using WALS that you might find relevant: “How European is Esperanto” (Language Problems and Language Planning 34:1 (2010), 63–79)
        .
        Online at http://benjamins.com/series/lplp/34-1/art/04par.pdf

        In my experience, the structure of Esperanto is not especially European – it’s a very regular agglutinative language (prefixes and suffixes do not change in spelling, sound or meaning). This is a characteristic that seems to be more important than many of the others listed in WALS. The regularity and flexibility of word formation makes the language much easier to learn than ethnic languages.

      52. Esperanto has features that make learning regardless of your home language. For example:

        * 5 widely separate vowel sounds
        * phonetic spelling
        * very few rules, and no exceptions to rules
        * no irregular declensions or conjugations
        * accented syllable is regular (always on the penult)
        * small vocabulary required because of the system of affixes

        Really, it seems as though you either did not think through your answer, or were eager to find reasons Esperanto would be hard to learn for those who speak Asian languages (for example). But Esperanto was relatively popular in both Japan and China, and it seems not to have presented the challenge that you claim.

      53. Michael Cysouw says:

        As mentioned earlier in the comments, I wrote a paper on this subject (actually, the paper is from 2005, but it only got published in 2011). The official link to the publisher is here: http://goo.gl/y4L2T

        The critical problem is to account for the missing information in WALS, which I did through some (rough) kind of randomisation. Mixtec is still high on the list, but Wari’ actually came on top in my calculations :-).

      54. What about sign languages? Would be interesting to see how they ranked, when possible, against the “weirdness index.”

      55. What looks really weird to me is to study 2678 languages, as stated, and not even mention Esperanto.

        Esperanto is spoken by a few million people, mainly as a second language, and by a small number of native speakers too. Much more than the majority of kanguages in the list.

        Someone could say it is an “artificial” language, but in fact, every language in the world is also artificial. None of them is spontaneously generated by nature, but “artificially” by people, are they?

        It would be really interesting to know about the “weirdness” of Esperanto.

      56. Spanish possesses an uncommon consonant: the th sound of “think.” But it is pronounced this way only in the north of Spain; that is, by a very small minority of Spanish speakers, by the way, myself included.

      57. Mayan Linguist says:

        Fascinating work. I agree with BostonL that WALS cannot be relied on. So even almost identical languages in grammar are listed as having differences they don’t have. Several have mentioned Spanish & Portuguese. I spent 20+ years working as a linguist in the Mayan area of mesoAmerica (and grew up there, studying some of the languages even earlier). Looking over what WALS has to say about some of them:Tzeltal vs Tzotzil, or Itzaj Maya vs Yucatec Maya or Cakchiquel vs Tzutujil vs Quiche [the spellings used show something about the data itself – and the last wasn’t in your list], I know for sure that there are ‘errors’ in WALS; these are the languages I spent so much time working with. As Boston said, some are from sources that were not the best, or were looking at one area and simplified other areas.
        Sometimes it’s just a matter that WALS could only put in what it was given (or that it interpreted), so a blank on a particular attribute does not necessarily mean the language doesn’t have it. A statement of an attribute and what that language does is not a guarantee that it does, but at least in the languages I checked, they mostly did have the feature mentioned, though I don’t necessarily agree that it is standard. For instance, in English, Charles H took exception with the WALS data, but they actually were right: I’ve listened to Yoda himself … and argue with Yoda I will not.
        There are other aspects that should be taken into account as well.
        I notice that Michael C’s list is different than yours, both based on WALS. The difference is that he tried to fill in the missing data points when he could [I hope not through some weird ‘randomisation’ as he worded it. Hopefully he meant ‘randomly’ filling in missing data when he could find such data].

        As with several others, I wish sign languages (and other systems – like whistle talk) could be included. And thanks for all the work you’ve done on this and on NLP!

      58. WowNoMentionOfGeorgian says:

        Wow, not a single mention of the Georgian language, which is widely considered unique.

      59. Fascinating article, but I’m puzzled by the huge gaps in the WALS data. By that I don’t mean that Dahalo only has information on two of the features; that I can understand. But how can Portuguese only have information 12 of them, while Spanish has 21? (And French has 20, German 20, English 21, etc. — this is all based on the spreadsheet which Tyler thoughtfully made available.)

        It’s not as if Portuguese is some obscure language, recently discovered, or with speakers who we have to trek through jungles and mountains to find! How hard can it be to find a linguistically trained Portuguese speaker to fill in those darned blank spots?

        It like a chemist filling in the periodic table, but skipping iron, boron, and palladium. How can a so-called world atlas be missing so much information on Portuguese? (As well as some other major or at least easy-to-find and live languages: Cantonese only has information on 15 of the features, Romanian 16, Hawaiian 15, Catalan 13, Gaelic (Scots) 11, Swedish 12, etc. etc.)

      60. Szerencsés vagyok, hogy magyar az anyanyelvem. Író vagyok, tehát ezzel foglalkozom, és én már majdnem teljesen értem. A felét. :-)

      61. David Marjanović says:

        Dutch, German, Bulgarian, and Greek are said to have no dominant word order.

        For Bulgarian and Greek, that may be mostly true. But German (and AFAIK Dutch) has three obligatory word orders for three different purposes: verb-second (by default SVO) for normal declarative clauses, verb-first (by default VSO) for questions, and verb-last (by default SOV) for dependent clauses (triggered by most but not all conjunctions). Deviations from the default patterns constitute emphasis, deviations from the placement of the verb are not possible. (…Well, you can fake verb-first declaratives by using OVS to emphasize the verb and then drop the O because it’s just a demonstrative pronoun. But that’s it.)

      62. I was wondering what you do with Punjabi in NLP? What is its weiredness score? Like a Programming Language is suppose to be turing complete, similarly, historical Indian stuies describied the completeness of lanueages, per them Sanskrit,Punjabi were complete. I am not an expert on what they based these on.

      63. Andreas Joswig says:

        Great Article, Tyler! I think I may use this data somewhere, especially when it comes to the German ranking! I’m sure you did not have enough data on Majang, did you?

      64. Sanskrit being oldest langauge in India. Rather is the mother of all Indian langauges. But would love to have its “weirdness index.”

      65. Tyler

        MKT’s question is important because it has to do with how the WALS codings happened. Rather than someone being in charge of “filling in Portuguese for all the features”, it was the case that each feature had some linguist trying to really determine what was going on for the feature and then coding as many features as possible for it. Each researcher working on a feature (or feature cluster) wanted to get diversity, but they were limited by time (can’t fill it out for everything) and materials (not every language has enough evidence for a value to be assigned).

        David M: The fact that different kinds of German sentences require different word orders is what leads WALS to say “no dominant word order”, although in this case it’s rather different than saying “you can put words in order any time you want to”.

        Punjabi: There are only 10 of the 21 features filled in for Punjabi, it scores low based on those 10 but 10’s not very many. If it scores like Hindi on the others, it would stick to a low value, though.

        Andreas: Thanks! I used the published WALS data for this, so not my personal additions for Majang, Shabo, Shekkacho, (or actually dozens of other African languages that I annotated after going through Greenberg’s collection of grammars). Checking those is on my list of things-to-do.

        Tyler

        ps-I’m a big fan of Georgian (the script is one of the most beautiful in the world, here’s me randomly mentioning it on a post about place names: http://idibon.com/place-names/). As you can see in the spreadsheet, Georgian has 20 of the 21 features filled in and a high (but not extremely high) value of 0.68 in the Weirdness Index (for the 239 languages with 14+ values, it’s number 52).

      66. As a Portuguese speaker myself living in S. California, i can attest the reverse does not apply. For the most part, the Spanish speaking population here in Los Angeles, has no idea of what i’m saying when i try to speak Portuguese to them, but i understand almost everything they say to me. This has always been a mystery to me. A one-way comprehension.

      67. There are many theories about the origins of Basque, but there’s nothing proved. Nowadays, Basque is considered an isolated language with no relation with any other language.

        By the way, I’m Basque :)

      68. Steve Black says:

        I wonder what would happen to this study if you were to add in a crucial feature of language use–context–into the mix. For instance, it would be fascinating to understand how communicative situations understood as yes/no moments in English actually DO occur in Mixtec. Or perhaps the lack of yes/no indication is part of a larger cultural pattern in which every statement is potentially up for discussion/interpretation. This is a really fun and interesting project!

      69. You’ve missed the whistling language used in the Canary Islands. It’s in decline and I understand it is now being taught in schools there to preserve it.

      70. Interesting article. Apparently there is some push back on using this WALS data from a variety of respondents, citing specific examples of why it’s deficient. If the linguistic community would take an idea from the open-source community and place the data on “git” for peer review and improvements.. all these old data problems would have a mechanism for removal and the data would continuously get better.

      71. English may, in fact, be weird. However it is the language of flight and business all over the ‘civilized’ world.

      72. Very interesting article, but it would be interesting to see how ‘weird’ these languages are if you also take into account the number of people speaking each language, so if 70% of the world’s languages have one feature, but only about 20% of the world population speaks these languages, that feature would become much more strange, this would probably cause a shift for languages like mandarin, english and spanish to be less weird, i just think it would give a more practical insight into which language is wierd.

      73. I don’t want to be too facetious but your assertion that a phoneme in mandarin cannot begin with an ‘ng’ as in ‘song’ is not strictly true- characters such as 嗯 are indeed pronounced ng. Would also have loved to see where classical chinese would feature on this list, has plenty of strange features. Interesting article though!

      74. And in the meantime, in the latest issue of Ameridian Journal of Contemporary Linguistics, or the Khoisan Subjunctive Daily, a similar article includes statements such as “did you hear about this hilarious language of theirs, English – its verbs do not show whether you are talking about something you saw in person or just heard of, hahaha! And you know what, some of their words have as few as three syllables! And do not get me started on the vowels, man… they have hardly a dozen. How can they even communicate with that thing?”

      75. Very good point, languages are so weird sometimes and so coloful, like the flowers in a garden.

        I would add to the list, my language, Albanian. Being an albanian by myself, I found it very distanced to the other languages here in Balkan and Europe.

      76. I think that evolution of a language should come into play. Such as how many years does it take before a past speaker, cannot understand a future speaker. I would guess that an English speaker from the 15th century would not understand a modern speaker. Would that hold true to say Hindi? Or maybe Latin? I would also guess that the language that evolvesthe slwest, would be the weird one.

      77. With all due respect to all languages, I am glad for Hindi to be the least weird language. And I am proud to know it.

        I hope with results of this research, I will be able to encourage others to learn Hindi

        Jai Hind, Jai Hindi

      78. The article is interesting, but looking at the full list am quite surprised to see that rbo-Croatian is treated as one language. Serbian and Croatian are 2 languages; they sound rather similar, have number of words in common, but have lots of grammatical differences and also use a different script – Serbian is written in Cyrillic and Croatian in Latin alphabet.

      79. To Pravit: Czech actually does have a particle that has a similar function as the Turkish “-miş” suffix you write about. It’s written “prý”, read as “pree”. You use it when you relay an information an you aren’t sure about it.
        “Prý je tam zima.” – “[Someone said/People say] there is cold [out there].”

        But in this case, it proves that Czech is really weird when you didn’t find any other language with such a feature.

      80. @Stuart: Just what I thought. I like the article, it’s interesting and inspiring. However, none of the ‘weirdness features’ relates to subject/object alignment (nominative-accusative vs. ergative systems). I suppose if this had been taken into account, the result might look a little different. Personally, I have always found Hindi and Basque rather ‘weird’ – and also cool and fascinating – because of the ergative phenomena.

      81. Ivan A Derzhanski says:

        But is it correct to say that ‘the Romance languages […] are a mixed bag: (Spanish=0.79, French=0.75, Italian=0.35, Romanian=0.30, Portuguese=0.17)’ when the divergence is mostly due to gaps in the data base?

        Some of the values are easy to supply: 57A Position of Pronominal Possessive Affixes—here Portuguese is just like Spanish; 111 Nonperiphrastic Causative Constructions—ditto, as far as I can tell; 44 Gender Distinctions in Independent Personal Pronouns—here they differ, and Spanish is weird indeed, having a m/f distinction in 1st and 2nd person plural pronouns; 19 Presence of Uncommon Consonants—again Spanish is unusually weird, thanks to its /θ/.

        If 98-99 Alignment of Case Marking were counted, Hindi would look weirder: ergativity by itself is not uncommon, but a tripartite system is (4 languages out of 190), and besides the Hindi ergativity is split by tense, which the WALS chapter doesn’t even consider (fie! fie!).

        Turkish might also come out closer to the weird end if 78 Coding of Evidentiality were in the game: the category expressed by its _-mış_ forms is by no means unique as some commenters have implied, but evidentiality encoded as part of the tense system is rare (24 languages out of 418).

      82. Ivan A Derzhanski says:

        If there were a Klingon row in the table, the values would be, in order, (0.99118943, 1, 0.3875969, 1, 1, 0.2804878, 1, 1, 0.10278373, 1, 1, 1, 0.16935484, 1, 0.62702703, 0.65425532, 1, 0.78947368, 0.49635036, 0.9137931, 1). Not having the formula, I don’t know what weirdness this would amount to, but there is a 76% correlation with the Burushaski vector (weirdness 0.53, one value missing) and a 66% correlation with the Abkhaz vector (weirdness 0.84, all 21 values present). Then follow Kolyma Yukaghir and German.

        • Tyler

          This answers a much-asked question!

          (You put “1” for values where Klingon did what the majority of human languages do, right?)

          The equation is to take the harmonic mean of these values (0.5086) and subtract it from one (so that weirder = higher value). That means Klingon has a weirdness value of 0.4914. That makes it similar in weirdness to Oneida (an Iorquoian language), Latvian, and Hawaiian.

          Of course Klingon was *intended* to be as non-human as possible. So a priori we might have hoped/expected Klingon to score as “really weird”. But of course, the features Mark Okrand chose to exemplify weirdness. I believe he especially chose weird combinations of sounds and odd word orders. Most word order phenomenon are correlated in human languages (if you have a verb at the end of sentence, you tend to have postpositions, not prepositions), so the fact that this particular correlation pattern is exactly what Okrand is violating is not captured. That’s actually quite a nice idea for an alternative–which languages consistently flout general correlations. Oooh.

          Thanks!

      83. Tyler

        This answers a much-asked question!

        (You put “1” for values where Klingon did what the majority of human languages do, right?)

        The equation is to take the harmonic mean of these values (0.5086) and subtract it from one (so that weirder = higher value). That means Klingon has a weirdness value of 0.4914. That makes it similar in weirdness to Oneida (an Iorquoian language), Latvian, and Hawaiian.

        Of course Klingon was *intended* to be as non-human as possible. So a priori we might have hoped/expected Klingon to score as “really weird”. But of course, the features Mark Okrand chose to exemplify weirdness are not necessarily the same. I believe he especially chose weird combinations of sounds and odd word orders. Most word order phenomenon are correlated in human languages (if you have a verb at the end of sentence, you tend to have postpositions, not prepositions), so the fact that this particular correlation pattern is exactly what Okrand is violating is not captured. That’s actually quite a nice idea for an alternative–which languages consistently flout general correlations. Oooh.

        Thanks!

      84. Ivan A Derzhanski says:

        For each feature, I looked up two unrelated languages which I knew did the same thing as Klingon, made sure their values were equal and then put down the same value for Klingon.

        Klingon was intended to seem as non-human as possible, but (given Okrand’s audience) this largely meant as unlike English as possible, and if English is weird, it is natural that its antithesis should not be. And many of the traits that really set Klingon apart from most or all human languages aren’t captured by your list of features, or even by WALS’s. (For example, lateral affricates aren’t among the uncommon consonants that WALS Ch. 19 looks for.)

      85. I’m not sure I agree about English and German not having a question particle. In German you have “Nicht Wahr” and in spoken English you have “Innit”. While these may not be regarded as formal language, many other languages are probably not as formalised as these two. The French equivalent is “n’est-ce pas”.

      86. Fascinating! I am a novice here but have a question. Is there a co-relation between age of the language (when it became popular ) and weirdness. Intuitively, it seems like a language that’d be created today would be less weird given all we know about structure.

      87. In Sweden, in small communities in the South-Western Woods, there is a language spoken only by a few hundred inhabitans. For Norwegians or common Swedes it is completely unintelligible. I don’t know where, but there should be information about it somewhere. I have heard that scientists have been collecting and studying it. It is spoken by lesser people every year because the youth leave home and stop using it.

      88. Sean Roberts says:

        This is a great post!

        Can I ask what were your exact criteria for choosing the 21 features? That is, did you run correlations or pick the features logically?

      89. Germans must be at #10 thanks to such words as “Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesel” and “Rechtsschutzversicherungsgesellschaften”…

        English’s “Antidisestablishmentarianism” only got them the 33rd place..

      90. I am surprised that the list leaves out Sanskrit. Although it has practically no speakers left, it has one of the most extensive grammar system of all languages. To illustrate, a verb like ‘Bhav’ ( to do ) has around 54 well defined forms covering all time aspects. Agreed that an extensive grammar may not meet this particular study’s criteria for weirdness but I would like to see where the parent languages like Sanskrit and Latin would be placed in the list.

      91. to @ Ineke ; That’s very interesting indeed! I immigrated to Holland and try to learn Dutch and, of course “het” and “de” are confusing, but the word “er” – now this is crazy. I come from Poland, I thought Polish is harder anyway (I heard this opinion many times). Surprisingly my native language has weirdness rate only 0,56 while Dutch: 0,84 ;)

      Comments are closed.