335 lines
14 KiB
Plaintext
335 lines
14 KiB
Plaintext
Microsoft Speech API 5.1 support for CMU Flite
|
|
----------------------------------------------
|
|
|
|
Copyright Cepstral, LLC, 2001 all rights reserved
|
|
|
|
David Huggins-Daines <dhd@cepstral.com>
|
|
December 6th, 2001
|
|
|
|
About the Flite SAPI port
|
|
-------------------------
|
|
|
|
Funding for this work was provided by the Instituto de Engenharia de
|
|
Sistemas e Computadores (Lisbon, Portugal). The port itself was done
|
|
by David Huggins-Daines at Cepstral, LLC (Pittsburgh, USA).
|
|
|
|
This work remains Copyright Cepstral, LLC but is distributed under
|
|
the same free software licence as CMU Flite.
|
|
|
|
What's here
|
|
-----------
|
|
|
|
This directory contains a port of CMU Flite 1.1 and the included 8kHz
|
|
diphone voice to the Win32 platform under Visual C++, as well as an
|
|
interface library that allows Flite voices to be compiled into COM
|
|
objects which implement the TTS engine interfaces for Microsoft's
|
|
Speech API 5.1 (SAPI).
|
|
|
|
What isn't here
|
|
---------------
|
|
|
|
There isn't a pointy-clicky tool for automatically converting voices
|
|
to SAPI objects. You are more than welcome to write one; it should be
|
|
fairly straightforward to do as a Visual Studio wizard or add-in or
|
|
whatever they call those things.
|
|
|
|
Instead, the procedure for making SAPI objects is documented below.
|
|
Feel free to ask me questions at the address above if you don't
|
|
understand some parts of it.
|
|
|
|
Some parts of the SAPI interface code are language specific, namely
|
|
the viseme and phoneme translation code, and to some extent the text
|
|
processing code. These have been implemented for US English only,
|
|
although the relevant functions are abstracted with function pointers
|
|
so each voice is free to choose its own. See "Language-Specific
|
|
Functions" below for more information.
|
|
|
|
Building and testing the example voice
|
|
--------------------------------------
|
|
|
|
In order to build the Flite SAPI code and example diphone voice, you
|
|
will need a copy of Visual C++ 6.0 or later, as well as the full SDK
|
|
for SAPI 5.1. If you have the Microsoft Platform SDK, you will need
|
|
to make sure to install the Internet Explorer SDK as well, as it
|
|
contains some IDL files that are required in order to build COM
|
|
objects with the Platform SDK.
|
|
|
|
Before you build anything, you will need to set up Visual C++ to find
|
|
the SAPI header and IDL files. To do this, select "Options" from the
|
|
"Tools" menu. In the resulting dialog box, switch to the
|
|
"Directories" tab. Now, add the "include" and "IDL" directories from
|
|
your SAPI SDK installation to the list. If you performed a default
|
|
installation of the SAPI SDK on the C: drive, they will be:
|
|
|
|
C:\Program Files\Microsoft Speech SDK 5.1\Include
|
|
C:\Program Files\Microsoft Speech SDK 5.1\IDL
|
|
|
|
Make sure the build type is set to "Win32 - Debug" in Visual C++. Set
|
|
the active project to "FliteCMUKalDiphone". Select "Build
|
|
FliteCMUKalDiphone.dll" from the "Build" menu. This will build all of
|
|
the other libraries before building the SAPI object.
|
|
|
|
The first time you use the voice, you will need to register it with
|
|
Windows. To do this, build the "register_vox" project and execute
|
|
register_vox.exe (you can find it in the
|
|
FliteCMUKalDiphone\register_vox\Debug subdirectory, or you can simply
|
|
run it from within Visual C++).
|
|
|
|
Although, normally, Visual C++ should automatically register the
|
|
FliteCMUKalDiphone object as a COM server, in some cases you may have
|
|
to do it manually. This can be done by running 'regsvr32
|
|
FliteCMUKalDiphone.dll' on the command-line from the build directory
|
|
(sapi\FliteCMUKalDiphone\Debug).
|
|
|
|
Now you can test the voice by running the SAPI "TTSApp" example
|
|
program, or any of the other examples included with the SAPI SDK.
|
|
|
|
How to SAPI-enable your own Flite voices
|
|
----------------------------------------
|
|
|
|
First, you'll obviously need to build your voice under Visual C++.
|
|
This is relatively straightforward. It's probably best to build it as
|
|
a static library. You'll need to make sure that it can find the Flite
|
|
header files as well as the ones for your language and lexicon, which
|
|
probably means adding their paths in the "Preprocessor" category of
|
|
the "C/C++" tab of the Project Settings dialog box.
|
|
|
|
One thing to be aware of is that the Microsoft linker will probably
|
|
break if you have very large voice data files as C code. To work
|
|
around this problem, you can change "Debug Info" from "Program
|
|
Database for Edit and Continue" to "Program Database" in the "General"
|
|
category of the "C/C++" tab in the Project Settings dialog box.
|
|
|
|
Since there is no support for dynamically discovering and loading
|
|
voices in Flite, each SAPI voice links to its own instance of the
|
|
engine. This also simplifies the distribution and installation of
|
|
voices considerably.
|
|
|
|
The common code used to implement the SAPI interface is in the
|
|
FliteTTSEngineObj library. Your voice should create a subclass of
|
|
CFliteTTSEngineObj. In the minimal case, you only need to provide a
|
|
constructor that sets the 'regfunc' and 'unregfunc' members of the
|
|
base class to the registration and unregistration functions defined in
|
|
your Flite voice.
|
|
|
|
To set up the voice object, create a new Visual C++ project, using the
|
|
project type "ATL COM AppWizard". For "Server Type", choose "Dynamic
|
|
Link Library (DLL)".
|
|
|
|
Now you must create a definition for the voice object. To do this,
|
|
switch to "ClassView" in the sidebar, right-click on the project name,
|
|
and choose "New ATL Object...". Then, from the selection box, choose
|
|
"Simple Object".
|
|
|
|
In the "Names" tab of the next dialog box ("ATL Object Wizard
|
|
Properties"), choose whatever names you like for your class. In the
|
|
"Attributes" tab, select "Both" in the "Threading Model" section, and
|
|
"Custom" in the "Interface" section.
|
|
|
|
Next, you need to edit the IDL file to import and use the relevant
|
|
SAPI interfaces. You should be able to find this file in the "Source
|
|
Files" section of your new project in the "FileView" tab in the
|
|
sidebar. If your project name was "FooVoice", the IDL file will be
|
|
called "FooVoice.idl". In order to import the SAPI interfaces, you
|
|
should add the following line to the list of 'import' statements at
|
|
the top of the file:
|
|
|
|
import "sapiddk.idl";
|
|
|
|
In order for Visual C++ to find this import, you will have to add the
|
|
SAPI IDL directory it to the Tools->Options dialog box as detailed
|
|
above (do not add it to the Project Settings, because MIDL.EXE is
|
|
broken and will not accept it.)
|
|
|
|
Just underneath the list of import statements, Visual C++ will have
|
|
created a bogus interface definition for your object, which will look
|
|
like this (the UUID and interface name will be different, of course):
|
|
|
|
[
|
|
object,
|
|
uuid(51284204-38B4-48C4-B65E-4FDAF6476D13),
|
|
|
|
helpstring("IFooVoiceObj Interface"),
|
|
pointer_default(unique)
|
|
]
|
|
interface IFooVoiceObj : IUnknown
|
|
{
|
|
};
|
|
|
|
You should delete this section. Then, you should change the list of
|
|
interfaces in the 'coclass' section to use the SAPI TTS Engine
|
|
interfaces. It should look like this (replace "FooVoiceObj" with the
|
|
name of your voice object):
|
|
|
|
coclass FooVoiceObj
|
|
{
|
|
[default] interface ISpTTSEngine;
|
|
interface ISpObjectWithToken;
|
|
};
|
|
|
|
Now you will need to edit the source code for your voice object. As
|
|
noted above, in the minimal case, you will simply need to include
|
|
"voxdefs.h" from your voice library, inherit from CFliteTTSEngineObj,
|
|
provide declarations for REGISTER_VOX and UNREGISTER_VOX, and
|
|
initialize the pointers to them in a constructor.
|
|
|
|
To do this, open the header file for your voice object. It can be
|
|
found in the "Header Files" section for your project in the "FileView"
|
|
tab in the sidebar. If your voice object name (as entered in the
|
|
"Names" tab in the "ATL Object Wizard Properties" dialog box above)
|
|
was "FooVoiceObj", then this file will be called
|
|
"FooVoiceObj.h".
|
|
|
|
First, add these declarations underneath the #include statements at
|
|
the top of the file:
|
|
|
|
#include "FliteTTSEngineObj.h"
|
|
#include "flite_sapi_usenglish.h"
|
|
extern "C" {
|
|
#include "voxdefs.h"
|
|
cst_voice *REGISTER_VOX(const char *voxdir);
|
|
void UNREGISTER_VOX(cst_voice *vox);
|
|
};
|
|
|
|
You will need to either put the full path to voxdefs.h in the #include
|
|
statement, or add the directory containing your voice's source code to
|
|
the list of extra include directories in the "Preprocessor" category
|
|
of the "C/C++" tab of the Project Settings dialog box. You may also
|
|
need to do the same for "FliteTTSEngineObj.h" (if you create your
|
|
voice within the "flite_sapi" workspace included here, you can enter
|
|
"..\FliteTTSEngineObj" here).
|
|
|
|
Next, you must adjust the inheritance list for your voice's class.
|
|
Remove the following lines marked with '-' and add the line marked
|
|
with '+':
|
|
|
|
- public CComObjectRootEx<CComMultiThreadModel>,
|
|
public CComCoClass<CFooVoiceObj, &CLSID_FooVoiceObj>,
|
|
- public IFooVoiceObj
|
|
+ public CFliteTTSEngineObj
|
|
|
|
You must also change the COM interface map to contain the SAPI
|
|
interfaces, by making the following changes (as above, remove the
|
|
lines marked with '-' and add those marked with '+'):
|
|
|
|
BEGIN_COM_MAP(CFooVoiceObj)
|
|
- COM_INTERFACE_ENTRY(IFooVoiceObj)
|
|
+ COM_INTERFACE_ENTRY(ISpTTSEngine)
|
|
+ COM_INTERFACE_ENTRY(ISpObjectWithToken)
|
|
END_COM_MAP()
|
|
|
|
Finally, add code to the constructor to set the 'regfunc' and
|
|
'unregfunc' members, and the language-specific functions, if you have
|
|
them, by adding the lines marked with '+':
|
|
|
|
public:
|
|
CFooVoiceObject() {
|
|
+ regfunc = REGISTER_VOX;
|
|
+ unregfunc = UNREGISTER_VOX;
|
|
+ phonemefunc = flite_sapi_usenglish_phoneme;
|
|
+ visemefunc = flite_sapi_usenglish_viseme;
|
|
+ featurefunc = flite_sapi_usenglish_feature;
|
|
+ pronouncefunc = flite_sapi_usenglish_pronounce;
|
|
}
|
|
|
|
Before you build the SAPI object, you will need to add the voice
|
|
library and the Flite libraries (flite.lib, plus the libraries for the
|
|
lexicon and language model, which are cmulex.lib and usenglish.lib for
|
|
US English) to the list of extra libraries (in the "Input" section of
|
|
the "Linker" tab of the Project Settings dialog box in Visual C++).
|
|
You also need to include "winmm.lib" here as it is required by the
|
|
Flite library. You'll also need to make sure that Visual C++ can find
|
|
the Flite libraries - you can either set their projects up as
|
|
dependencies of your SAPI object's project, or you can add a list of
|
|
relative paths to their build directories in the Project Settings.
|
|
|
|
Registering and testing your new SAPI voice object
|
|
--------------------------------------------------
|
|
|
|
Now that you've built your voice as a SAPI component, you must
|
|
register it with the system so that it can be found and used by
|
|
programs using the SAPI interface. Predictably, this involves
|
|
twiddling bits in the Windows Registry.
|
|
|
|
Source code and a Visual C++ project is provided (in register-vox.cpp)
|
|
for a small command-line program which performs the necessary
|
|
operations for the CMU diphone voice. To use it for another voice,
|
|
you will need to make the modifications noted by /* CHANGEME */
|
|
comments in the source code.
|
|
|
|
To test your SAPI voice, use the TTSApp program included with the SAPI
|
|
SDK. You may find it helpful to build the debugging version of TTSApp
|
|
from source code and specify it as the executable to run when
|
|
debugging for your SAPI object's project. This will allow you to set
|
|
breakpoints in your code and get proper backtraces and so forth.
|
|
|
|
Language-specific functions
|
|
---------------------------
|
|
|
|
The language-specific functions for US English are contained in the
|
|
files flite_sapi_usenglish.c and flite_sapi_usenglish.h. The SAPI
|
|
object is set up to use them by initializing four function pointers
|
|
which are members of class CFliteTTSEngineObj:
|
|
|
|
int (*phonemefunc)(cst_item *s);
|
|
|
|
This function takes a cst_item representing a single phone (usually a
|
|
member of the "Segment" relation) and returns the appropriate SAPI
|
|
phone ID for it.
|
|
|
|
int (*visemefunc)(cst_item *s);
|
|
|
|
This function takes a cst_item representing a single phone and returns
|
|
the appropriate SAPI viseme ID for it. The SAPI visemes are
|
|
potentially language independent, though they are expressed in the
|
|
documentation in terms of US English phonemes. A more general
|
|
description of them is included below.
|
|
|
|
SP_VISEME_0 silence
|
|
SP_VISEME_1 low mid/front unrounded vowels (ae, ah, ax)
|
|
SP_VISEME_2 low back unrounded vowels (aa)
|
|
SP_VISEME_3 low/mid-low back rounded vowels (ao)
|
|
SP_VISEME_4 mid front unrounded vowels (eh, ey)
|
|
SP_VISEME_5 English mid rhotic vowel (er)
|
|
SP_VISEME_6 high front unrounded vowels and glides (ih, iy, y)
|
|
SP_VISEME_7 high back rounded vowels and glides (uw, w)
|
|
SP_VISEME_8 rounded-to-rounded rising diphthongs (ow)
|
|
SP_VISEME_9 unrounded-to-rounded rising diphthongs (aw)
|
|
SP_VISEME_10 rounded-to-unrounded rising diphthongs (oy)
|
|
SP_VISEME_11 unrounded-to-unrounded rising diphthongs (ay)
|
|
SP_VISEME_12 English glottal fricative (hh)
|
|
SP_VISEME_13 English retroflex approximant (r)
|
|
SP_VISEME_14 English lateral approximant (l)
|
|
SP_VISEME_15 grooved alveolar/dental fricatives (s, z)
|
|
SP_VISEME_16 palatal fricatives and affricates (sh, zh, ch, jh)
|
|
SP_VISEME_17 interdental fricatives (th, dh)
|
|
SP_VISEME_18 labiodental fricatives (f, v)
|
|
SP_VISEME_19 alveolar/dental occlusives (d, t, n)
|
|
SP_VISEME_20 velar occlusives (k, g, ng)
|
|
SP_VISEME_21 bilabial occlusives (p, b, m)
|
|
|
|
When adapting them to your phoneset, remember that the position of the
|
|
lips and teeth as viewed from the front is more important than the
|
|
place or manner of articulation /per se/.
|
|
|
|
Also, while the US English code implements this using a table lookup,
|
|
it may be more appropriate to determine them algorithmically from the
|
|
feature values used in your phoneset.
|
|
|
|
int (*featurefunc)(cst_item *s);
|
|
|
|
This function returns a bitmask used by SAPI to indicate placement of
|
|
stress or emphasis (see the pages on the SPEVENTENUM and SPVFEATURE
|
|
enumerations in the SAPI documentation for more information on this).
|
|
If you use "stressed" and "accented" as the feature names for these
|
|
things in your language code, you can just copy
|
|
flite_sapi_usenglish_feature().
|
|
|
|
cst_val *(*pronounce_func)(SPPHONEID *spids);
|
|
|
|
This function takes a zero-terminated array of SAPI phone identifiers
|
|
and converts it to a list of cst_val containing the phone names as
|
|
used by Flite as strings. You will probably want to simply copy
|
|
flite_sapi_usenglish_pronounce(), changing the tables it uses to look
|
|
up phone names.
|