Research

The following are abstracts of some research papers published by Diwakar Mishra. For some papers, he is the main author, and in some others, he is co-author, in which the major contribution is from other authors. Some of the papers which are allowed to redistribute are made available on my ResearchGate and Academia profiles.

A Speech Synthesis System for Sanskrit Prose (PhD Thesis)

To download the PDF of the thesis, click here


Grapheme to Phoneme converter for Sanskrit Speech Synthesis

Abstract: The paper presents a Grapheme-to-Phoneme (G2P) converter as a module for Sanskrit speech synthesis. While Spoken Sanskrit is used in a limited specific context, and for general purpose by a fewer number of people (according to census of India data), the socio-cultural value of the language retains its significance in the modern Indian milieu. The accessibility to Sanskrit resources is of utmost importance in India, and also in the world for the knowledge discourse of Sanskrit. This paper, presents the development of a standalone G2P converter for Sanskrit based on the model developed by HP Labs India (and released through Local Language Speech technology Initiative). The converter system takes as input the Sanskrit Unicode text in UTF-8 format and returns the sequence of phones with word and sentence boundaries written in the output file. The input text for this system is supposed to be in normal word form, i.e., if there are numbers or abbreviations, those should be expanded into words. The system maps the characters applies the specific rules that are necessary for a conversion of orthographic representation to a phonetic representation of Sanskrit. The system converts into phoneme word by word, thus cross-word modifications are not dealt with.

Text Normalizer for Sanskrit

Abstract: Though Sanskrit writing system is very close to phonetic there are, like other languages, many other words in running text which are not pronounced as they are written, or they are not in word form but are pronounced. Converting such words and other utterable tokens into standard word form is known as text normalization. Normalization, with many other uses, plays important role in speech synthesis or Text-to-Speech (TTS) systems. The normalization process mainly can be divided into two main stages – recognition of non-standard words, and, converting them to standard words. The ambiguity resolution also can be an intermediate stage or a part of the recognition stage. Normalizer does some other secondary, but not less important, tasks like, cleaning of text, appropriate placing of punctuation marks (in current case, isolating attached symbols from words). The present paper describes an effort of a system which recognizes a few types of non-standard words and converts them into standard Sanskrit word sequence. The program is developed in Java and recognizes and converts into words a few formats of numbers and short forms of words, primarily text names in Unicode Sanskrit text. The system applies lexical approach for short forms of words and rule base approach for different formats of numbers.

Syllabification and Stress Assignment in Phonetic Sanskrit Text

Abstract: The authors of this paper have developed a speech synthesis system for Sanskrit. This paper will present the Grapheme to Phoneme (G2P) converter module used in this system which converts Sanskrit text in Devanagari UTF-8 into its phonetic representation with assigned syllable boundaries and stress values of the syllable. The stress rules applied here are very basic and are different from Vedic supra-segmental svaras. Though the Sanskrit G2P converter, converting Unicode Sanskrit text into plain phone sequence representation, was already there, it lacked syllabification and stress marking. Syllable is very important unit in speech and in speech technology. Festival framework, which is used for said Sanskrit speech synthesis and many other speech synthesis systems, by default, requires the phonetic representation with syllable boundaries and stress values. Also many features used for F0 and duration modeling depend on syllable. Thus the phonetic representation with and without syllabification makes a significant difference.

Challenges in Developing a TTS for Sanskrit

Abstract: In this paper the authors present ongoing research on Sanskrit Text-to-Speech (TTS) system called ‘Samvachak’ at Special Centre for Sanskrit Studies, JNU. No TTS for Sanskrit has been developed so far. After reviewing the related research work, the paper focuses on the development of different modules of TTS System and possible challenges. The research for the TTS can be divided into two categories – TTS independent linguistic study, TTS related Research and Development (R&D). The TTS development is based on the Festival Speech Synthesis Engine.

Keywords: TTS, Speech Synthesis, Festival, normalization, word recognition, sentence recognition, phonotactics, POS, annotation, speech database

A Comparative Phonological Study of the Dialects of Hindi

Abstract: Dialectal variations provide vital cues to both synchronic and diachronic changes in sounds of a language. There has been no comparative phonological study of the dialects of Hindi in the last several decades. In this paper, we present a phonological description of seven of the major dialects of Hindi, namely, Awadhi, Bagheli, Bhojpuri, Bundeli, Haryanvi, Kanauji and Khari Boli, based on the observation and analysis of telephonic conversational data. We believe that these preliminary results will as a starting point for a more comprehensive and detailed comparison of the dialects and provide insights for language evolution as well as synchronic variations of Hindi.

Keywords: Hindi, Comparative study, Dialects

Hindi Dialects Phonological Transfer Rules for Verb Root Cǝlǝ

Abstract: Most Natural Language Processing (NLP) applications need to account for synchronic variations in a language as represented by its major dialects. However, most corpora available for the training and development of such systems tend to be dialect neutral. A framework that models synchronic variation can make NLP and Speech technology systems more robust to dialect variations. In this paper we present basic phonological transfer rules from standard Hindi to a number of its prominent dialects. We believe that this can be the first step towards a more general model for dialect variation in Hindi. The rules here describe morphophonemic change in simple verb forms between dialects taking the example of verb root cǝlǝ.

Evaluating Tagsets for Sanskrit

Abstract: In this paper we present an evaluation of available Part Of Speech (POS) tagsets designed for tagging Sanskrit and Indian languages which are developed in India. The tagsets evaluated are - JNU-Sanskrit tagset (JPOS), Sanskrit Consortium tagset (CPOS), MSRI-Sanskrit tagset (IL-POST), IIIT Hyderabad tagset (ILMT POS) and CIIL Mysore tagset for the Linguistic Data Consortium for Indian Languages (LDCIL) project (LDCPOS). The main goal behind this enterprise is to check the suitability of existing tagsets for Sanskrit from various Natural Language Processing (NLP) point of view.

Keywords: Astadhyayi, POS tagging, POS tagger, tagset, morphology, WSD, machine learning.

Annotating Sanskrit Adapting IL-POST

Abstract: In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b), developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.

Keywords: POS Tagset, IL-POSTS, MSRI, EAGLES, hierarchical tagset,

Discourse Anaphora and Resolution Techniques in Sanskrit

Abstract: Sanskrit has a fairly rich tradition of complicated prose writings. While the Vedic texts were mostly poetic, prose evolved as a major form of literary expression in the pre-classical and classical Sanskrit. Since brevity and erudition were the hallmarks of ‘good’ Sanskrit prose, there evolved a literary style of Sanskrit prose rich in discourse anaphors and other kinds of long distance references. Though a typical Sanskrit processing system (Manji et al, 2008) needs to handle such issues, there has been very little progress in theoretical or applied approaches to Sanskrit anaphora. Jha et al (2008) have presented a working model for sentential anaphors in Sanskrit which needs to be extended to discourse as well.

Sanskrit syntax has not been a favorite topic with linguists. Some generic work on Indic languages like those of Hock (1991), Davison (2006) have looked at diverse syntactic issues often not excluding anaphora. Shapiro (2003) has focused on lexical anaphors and pronouns in the languages of the subcontinent. Sobha et al (1998, 1999-I&II, 2007) have looked into the anaphora cases for some Indian languages in great detail and in particular for Sanskrit in their most recent paper (2007). Jha et al (2008) have presented strategies from vyŒkarana (grammar), mimamsa (interpretation), and nyaya (logic) to arrive at a working model on Sanskrit anaphors. However, there is no treatment of discourse anaphor in Sanskrit.

The authors in this paper are looking at the problem in a broader perspective. They have collected and classified cases of discourse anaphors in Sanskrit from a wide ranging sample from earliest times to the 18th century text of AmbikŒ Dutt VyŒsa (êivarŒjavijaya). The paper also considers popular didactic prose texts like Pa–catantra and Hitopade§a and also some poetic texts like Bhagvadg´tŒ and RŒmŒyaöa for arriving at a sound description of discourse anaphora in Sanskrit. The paper then presents a computational model to handle such cases in Sanskrit and presents some of the components developed for Sanskrit POS tagging and morphological processing.

Anaphors in Sanskrit

Abstract: Research in building robust NLP systems with ambiguity resolution techniques has gained momentum in recent years. In particular, the anaphora resolution initiatives have reached unpreceented heights in last 10 years or so. Mitkov et. al (2001) have reported both rule based knowledge based approaches and machine learning based ‘knowledge poor’ approaches in an ACL issue devoted to this subject. Mitkov (2001 a) has also presented outstanding issues and challenges in this area. Johansson ed. (2007) reports the latest developments in this area of research and development.

Indian languages in general, and Sanskrit in particular have not been profusely worked upon from these perspectives. Barring a notable exception (Sobha 2007), Sanskrit anaphors have been rarely looked upon from a computational perspective. The case of Sanskrit has been more severe due to two reasons - a virtual absence of annotated corpora made it impossible for corpus based machine learning approaches and a poor understanding of Panini’s grammar from computational perspective has made it difficult to apply rule based approaches. While some work on Indic languages like those of Hock (1991), Davison (2006) have looked at diverse syntactic issues often not excluding anaphora, Shapiro (2003) has focused on lexical anaphors and pronouns in the languages of the subcontinent. Sobha et al (1998, 1999-a&b, 2007), Murthy et al (2005) have looked into the anaphora cases for some Indian languages in great detail and in particular for Sanskrit in their most recent paper (2007) as mentioned above.

The authors in this paper are looking at the problem in a broader perspective. Since no effort has been made at comprehensive documentation and classification of Sanskrit anaphora, this is the primary focus of the present study. Similar to Soon, Ng and Lim (2001), the anaphora resolution presented here is proposed to be a part of the larger NLP system called Sanskrit Analysis System parts of which have been developed by the principal author and his research students at the Sanskrit Center of Jawaharlal Nehru University, New Delhi (Jha et al 04,05,06,07,08).

An Algorithm for Morphophonemic Processing of Sanskrit

Abstract: Splitting continuous strings into meaningful constituents applying rules of grammar has been a challenging problem in Natuaral Language Processing. While humans produce complex strings in continuum they understand them as separate words. This cognitive capacity of humans can be understood by rules of grammar in a language allowing combination of words into infinitely long meaningful strings. Sanskrit is one such language where grammar rules have been provided for generative and analytical processes for natural languages. Developing a complete sandhi splitting for Sanskrit has not been possible so far. This paper presents an algorithm for a sandhi analyzer system for Sanskrit based on analytical PΚinian formulations. It also discusses the intermediate results of the R&D done so far and the limitations of the system. The process flow of the system is as follows:

input Sanskrit text

viccheda eligibility tests

(pre-processing)

subanta processing

search of sandhi marker and sandhi patterns

(‌‌sandhi rule base)

generate possible solutions

(result generator)

search the lexicon

subanta processing

(to parse the vibhakti of first segment, if any)

output (segmented text)

Strategies for Metrical Analysis of Sanskrit Text

Abstract: In this paper author presents a model for metric analysis of Sanskrit poetry text. Sanskrit poets have composed their compositions both in prose and poetry. Their poetry is in several chandas. Though these chandas are very complicated with a fixed order of length of syllables, they have followed them strictly. This feature makes them good for their computational analysis. The chanda can be recognized in following three steps:

  • To mark all the vowels.
  • To mark vowels as laghu/guru.
  • To map the order of the line with the definitions of the chandas and identify the chanda.

Using these steps, the system can be built for the text written in diacritics, i-trans or Unicode. Here the system of authors is competent in analyzing the Devanagari text written in Unicode.