Speech Synthesis Markup Language: An Introduction
by Peter Mikhalenko
October 20, 2004
Speech Synthesis
Markup Language Specification (SSML 1.0), introduced in September
2004, is one of the standards enabling access to the Web using spoken
interaction. It's designed to provide a rich, XML-based markup
language for assisting the generation of synthetic speech in web and
other applications. The essential role of SSML is to provide authors
of synthesizable content a standard way to control aspects of speech
such as pronunciation, volume, pitch, rate, etc., across different
synthesis-capable platforms.
Background
The SSML specification is based upon JSML and JSGF specifications, which are
owned by Sun. Originally JSML (JSpeech Markup Language) was developed
as a very simple XML format used by applications to annotate text
input to speech synthesizers. JSML had characteristics very similar to
SSML: it defined elements that described the structure of a document,
provided pronunciations of words and phrases, indicated phrasing,
emphasis, pitch and speaking rate, and controlled other important
speech characteristics. The letter "J" in the markup language
name has come from the Java(TM) Speech API, introduced by Sun in collaboration with leading speech
technology companies, for incorporating speech technology into user
interfaces of applets and applications based on Java technology. The
design of JSML elements and its semantics are quite simple. Here is
the typical self-explaining example:
<jsml>
<voice gender="female" age="20">
<div type="paragraph">
You have an incoming message from
<emphasis>Peter Mikhalenko</emphasis> in your mailbox.
Mail arrived at <sayas class="time">7am</sayas> today.
</div>
</voice>
<voice gender="male" age="30">
<div type="paragraph">
Hi, Steve!
<break/>
Hope you're OK.
</div>
<div>
Sincerely yours, Peter.
</div>
</voice>
</jsml>
The JSpeech Grammar Format (JSGF) is a representation of grammars for
use in speech recognition. It defines a platform- and vendor-independent way to describe one type of grammar, a rule grammar (also
known as a command and control grammar or regular grammar). Grammars
are used by speech recognizers to determine what the recognizer should
listen for and so describe the utterances a user may say. JSGF is not
an XML format and is out of scope of this article.
SSML's Place in the Global Scope
Voice browsers are a very important part of Multimodal
Interaction and Device Independence, making web applications accessible with multiple
modes of interaction. A voice browser is a device that interprets a markup language and is capable of generating voice output
or interpreting voice input, and possibly other input/output
modalities. There is a whole set of markup specifications for voice
browsers developed at W3C, and SSML is a part of it. Speech
synthesis is a process of automatic generation of speech output
from data input which may include plain text, marked up text or binary
objects. It must be practical to generate speech synthesis output from
a wide range of existing document representations. The common
requirement to speech synthesis markup is that speech output from
HTML, HTML with CSS, XHTML, XML with XSL, and DOM must be
possible. The intended use of SSML is to improve the quality of
synthesized content.
Language Use
The key concepts of SSML are
- interoperability, or interacting with other markup languages (VoiceXML, SMIL etc.);
- consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and
- internationalization, or enabling speech output in a large number of languages within or across documents.
The system of automatic generation of speech output from text or
annotated text input that supports SSML must render a document as
spoken output using the information contained in the markup to render
the document as intended by the author. There are several steps in a
speech synthesis process.
- XML parse. The incoming text document is parsed and the
document tree with content are extracted.
- Structure analysis. The structure of a document influences
the way in which a document should be read. For example, there are
common speaking patterns associated with paragraphs and
sentences.
- Text normalization. All written languages have special
constructs that require a conversion of the written form (orthographic
form) into the spoken form. Text normalization is an automated process
of the synthesis processor that performs this conversion. For example,
for English, when "$1000" appears in a document it may be spoken as
"one thousand dollars." The orthographic form "1/2" may be
potentially spoken as "one half," "January second," "February first,"
"one of two," and so on. By the end of this step the text to be spoken
has been converted completely into tokens. The exact details of what
constitutes a token are language-specific. A
special
<say-as/> element can be used in the input
document to explicitly indicate the presence and type of these
constructs and to resolve ambiguities.
- Text-to-phoneme conversion. After the processor has
determined the set of words to be spoken, it must derive
pronunciations for each word. Word pronunciations may be conveniently
described as sequences of phonemes, which are units of sound in a
language that serve to distinguish one word from another. Each
language has a specific phoneme set. This step is quite hard and
complex according to several reasons. First of all, there are
differences between written and spoken forms of a language, and these
differences can lead to indeterminacy or ambiguity in the
pronunciation of written words. For example, in English, "read" may be
spoken as "reed" (I will read the book) or "red" (I have read the
book). Both human speakers and synthesis processors can pronounce
these words correctly in context but may have difficulty without
context. The
<phoneme/> element of SSML allows a
phonemic sequence to be provided for any word or word sequence.
- Prosody analysis. Prosody is the set of features of
speech output that includes the pitch (also called intonation or
melody), the timing (or rhythm), the pausing, the speaking rate, the
emphasis on words and many other features. Producing humanlike
prosody is important for making speech sound natural and for correctly
conveying the meaning of spoken language. In SMIL there are special
elements
<break/>, <emphasis/>
and <prosody/> for prosody purposes, which I will
describe below.
- Waveform production. This is a final step in producing
audio waveform output from the phonemes and prosodic
information. There are many approaches to this processing step so
there may be considerable processor-specific
variation. The
<voice/> element in SSML allows the
document creator to request a particular voice or specific voice
qualities (e.g. a young male voice).
SSML provides a standard way to specify gross properties of
synthetic speech production such as pronunciation, volume, pitch,
rate, etc. Exact specification of synthetic speech output behavior
across disparate processors, however, is beyond the scope of the SSML
specification. It should be noticed that markup values are merely
indications rather than absolutes. For example, it is possible for an
author to explicitly indicate the duration of a text segment and also
indicate an explicit duration for a subset of that text segment. If
the two durations result in a text segment that the synthesis
processor cannot reasonably render, the processor is permitted to
modify the durations as needed to render the text segment.
[1] [2] Next