Code Mixing and Switching: A Preliminary Introduction

When you speak two languages but start losing vocabulary in both of them — Byelingual

Main ye article iss liye likh rahi hu kyunki I want to document what I learnt. (Translation: I am writing this article because I want to document what I learnt.)

What are some observations that you can make about the above sentence? The first half is in Hindi (written in Latin script) while the latter part of the sentence is purely in English. If you are bilingual like me or for that matter, multilingual — you might have heard your friends speak in a similar way.

Switching between or mixing Hindi (or another native Indian language) and English is a common practice in urban Indian communication. It can even be observed in communication involving speakers who are not necessarily ‘bilingually proficient’ in English in many ways.

In fact, drawing upon my experiences at university where my friends who speak Marathi, Bengali, Kannada, etc, often mix these native languages with each other or English as well. In a fit to learn all of these languages at once, I have borrowed some words from some of these into my vocabulary which I use in my usual Hindi or English sentences. But not everyone mixes or switches languages for the same reason, which begs the question:

Why Do We Mix Languages?

While I might mix certain languages for one reason, and certain others for another reason, there is a wide range of possibilities why different people choose to mix or switch between different languages.

When it comes to English insertions with Hindi, it’s often because there is no appropriate word in the language or at least in the speaker’s vocabulary. This has been pointed out very specifically for Hindi-English code switching by Aung Si[1], as they summarize the reasons put forth for code-switching in India:

1. For neutralization because English lexical items are often perceived as being attitudinally and contextually neutral, and may be used to conceal social or regional identity (cited in Kachru, 1978).

2. A range of discourse-related functions, including repeating, emphasizing, heightening contrast, creating surprise, making parenthetical remarks, teasing, challenging or reporting others’ speech (Gupta, 1991).

3. Reasons, suggested by speakers, such as: ‘if I do not get the appropriate word in Hindi’, ‘easy to communicate’, ‘when we are short of words’, ‘to speed up communication’, ‘habit’, ‘unintentional’, ‘makes me feel comfortable’, ‘interesting and funny’, ‘scope of expression’ and ‘’cos it gives me a feeling of Indianism [sic]’ (Eilert, 2006).

What is Code Switching and Code Mixing?

The term coined by linguists for this switching between two languages is code-switching. In simple words, it refers to the switching or alternation between two or more languages by the speaker in a conversation. It is sometimes used interchangeably with the term code-mixing but they mean different things depending on the focus of the research. Code-mixing, specifically could mean embedding words, phrases or morphemes from one language into another. [2]

The best examples to explain the differences between the two would be from my favourite paper on this topic: “I’m borrowing ya mixing?”[2]

Translation: I was going for a movie yesterday. I met Sudha on the way.

The utterance given above is an example of code switching where the speaker completely switches from English grammatical system to Hindi.

Translation: I was going for a movie yesterday and I met Sudha on the way.

This would be an example of code mixing where ‘movie’, and ‘I met’ define English language insertions into a Hindi language utterance.

In simple words:

Why is this Topic Important?

Code-switching and code-mixing are not only seen in real life conversations but digitally as well, which includes emails, texts and social media interactions, falling under “Computer Mediated Communication” or CMC. When it comes to building natural language processing tools for human-machine communication, technology can go much farther to understand the way people naturally speak and be able to understand multilingual commands or interactions. Here are some reasons documented by researches at Microsoft and CMU [3]:

For healthcare, understanding how people feel, if they are being open, will help to give better care, and enable better communication with patients, and better distribution and uptake of preventative care.

For educators, communication in the right register for tutoring, or understanding if concepts are or are not understood.

For entertainment, non-playing characters should communicate in the appropriate register for the game, and/or be able to understand natural code-switched communication with other players. [3]

What are the Challenges?

Simply put, it’s going to be harder to find code switching data than monolingual data for the following reasons:

Monolingual corpora will always be easier to find as monolingual discourse is more common in formal environments and hence more likely to be archived. Code-switching data, by its nature of being used in more informal contexts, is less likely to be archived and hence harder to find as training data. Also as code-switching is more likely to be used in less task specific contexts, with less explicit function it may also be harder to label such data. [3]

In Hindi-English code switching through text, Hindi written in the Latin script sometimes might not have a defined standard for spellings and where it is present, it would be possible to find anomalies due to different ways in which people use contractions on social media. This was one of the many challenges that one might come across in code mixed data language identification and other metrics. Moreover, the levels of fluency of different speakers, the social context and constraints, including the various reasons why people are code-mixing, will all contribute to variations in code switching so generalized models are harder to implement across languages.

CM Data Analysis Metrics: An Introduction

To mathematically visualize and interpret CM or codemixed data (including both intra- and inter- sentential mixing i.e. code-mixed and code-switched respectively), certain metrics have been proposed[4] including Code Mixing Index (CMI) and Integration Index (I-Index). These among other metrics are extremely crucial and are explained in depth in Source [4] and by my co-intern, Uma.

Both CMI and I-Index require identification of a ‘matrix language’. The concept of matrix language vs. embedding language is instrumental to analyzing these for code-mixed utterances or intra-sentential mixing.

What is the Matrix Language?

Matrix languages provide abstract grammatical frames where embedding languages are inserted.

According to Myers Scotton’s Matrix Language Frame model, the matrix language is defined as the language that “projects the morphosyntactic frame for the utterance in question” which has also been called the more ‘dominant’ language.

What is Embedding Language?

The embedding language is the other language in the sentence. The embedding language words within the sentence might be single lexical items.

However, this can also take different forms as sub-lexical code mixing, an example of this in hindi-english code mixing would be:

Translation: “I will keep the bottles and come.”

where “bottle” being the English word extended as a plural word with “-en” normally used in Hindi as “ऐ”, the entire word being “बोतलें” (bottles).

Often Matrix Language might be generalized as the more frequent occurring language in the sentence for simplification. However, this isn’t always true, often the matrix language might be providing the basic grammatical structure while embedding language words might be more frequent. It should also be understood that the matrix language could possibly change at given point in an utterance.

Keeping all of these constraints in mind, various methods to propose matrix and embedding languages in bilingual code-mixed datasets have been proposed. Source [6] highlights once such priority approach.

Acknowledgements and References

In my limited knowledge as a first time intern in Natural Language Processing, I am excited to have had explored these concepts! I hope anyone who reads this article had something to take away from this. For feedback and further discussions, please email me.

My understanding of these concepts primarily comes from the papers that I have listed here; most of which, my internship mentor, Laiba Mehnaz provided us with. My focus in this article was to put these concepts as simply as possible but if you are looking for an in-depth understanding, these references are a good place to start: [1] [2] [3] [4] [5] [6]

Other Resources:

A diachronic investigation of Hindi–English code-switching using Bollywood film scripts; A Hindi-English Code-Switching Corpus; Changes in Code-Switching Patterns among Hindi-English Bilinguals in Northern India;

electrical engineering junior, exploring everything. she/her

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store