Technology

Infrastructure

Three infrastructure components are necessary to build aninteractive voice response system:

Hardware

The hardware to be used does not differ from a normal server system with the following differences:

  1. interactive voice response systems use more RAM: typically 2 Gigabyte (GB) and more.
  2. Typically no databases or internet applications are installed on speach-dialogue-servers.
  3. Special ISDN boards from Dialogic or Acculab are used, which differ from normal ISDN boards rearding the following points:
    • One or more S2M–ports are supported. Every S2M-port allows 30 concurrent calls.
    • The cards have high technology integrated, which supported special telephone functions, like:
      • Echo suppression
      • Hissing suppression
      • Conferencing
      • Call reconnection to external phone numbers

ENAiKOON decided to use Dialogic boards. This company was bought by Intel in 1999 and today they are the market leader with a market share of approx. 70%.

Standard Software

Operating System (OS)

Interactive voice response systems typically use operating systems like WinNT or Win2000. The most important reason for this is that the driver implementation and the implementation of the software libraries was usually done for Windows systems. Only recently some Linux implementations are available.
Up to now ENAiKOON did not migrate its IVR systems to Linux due to stability reasons.
ENAiKOON is planning to migrate asap because of the standardization of its
web server farm and its voice server farm.

Speech recognition system

There are various providers of good speech recognition software such as IBM, Philips and Nuance. ENAiKOON chose to go with speech recognition software from Nuance, because we found this system to suit our needs. The adaption of the English and German language was done very nicely by Nuance.
Nuance communications is the world leader regarding this type of software. Companies like British Airways, Deutsche Telekom, Loyds TSB Bank, SAS and Telia Mobile use Nuance software.

Text-to-Speech Engine

Speech synthesis describes the computer based converstion from written text into spoken text. This method is normally used when the text output is extremely dynamic. In any other cases ENAiKOON uses trained speakers for its voice systems.
ScanSoft is quoted on the exchange (Nasdaq: SSFT) with over 500 employees and subsidiaries all over the world. Their integrated TTS modules are used by leading manufacturers in telecommunications, mobile communications, vehicle communications and allow voice enabled components for UMTS, IVR, telematic-products, wireless and pocket pc solutions.
Scansoft has recently been acquired by Nuance.

Database

Depending on the size of the database we use MySQL or Oracle. In both cases the database is installed on a seperate server. This ensures that the voice application and the database application do not influence each other.

Application

The application represents the logic of the employment. In the application it is defined how the user is greeted, which infos must be given at what time, and how the system must react. Furthermore the application has some typical speach application elements such as grammar, a list of words and phrases, which the computer can understand.

Application Development

The application development is split into the following parts:

Specification

The specification describes in detail, which tasks the voice application has to cover and how this should happen. Furthermore it describes the impression the caller should have of the system. (e.g. more serious or more youthful).

Development of a Program Flow Chart

The flow sheet defines exactly in which steps the program has to "flow". The program sequence defines exactly, in which steps the program is to run-off, which questions in which situations are to be asked, which answers are to be expected and to which answers is to be reacted. Furthermore the data is written down to which application are available.

Grammar and Dictionary

The grammar is a list of words and phrases, which the application must be able to handle, especially grouped into words and phrases per question the system asks.
Example: When asking "When do you want to leave?" the user can answer differently like "now", "tomorrow" or "on the 14. january at 6 o'clock".
No matter what is answered, the system must calculate a concrete date and time so that a database query can be successful.
The dictionary includes a transcription of all words and names. With the help of the dictionary it is determined how a word is written which is said in a certain manner. For example Google is pronounced "Gu:gel". This means: if a caller says „Gugel“ the machine without dictionary would understand „Gugel“, not „Google“.

Coding of the Application

When the application is coded theis means, that the previously defined routines are programmed into the system. ENAiKOON does this by using predefined libraries which then are updated with special elements of the application.
Example: The query of a certain time is allready implemented. Only this function has to be used in this case. While coding and testing, the speach output is realized with the help of the TTS-Engine (Text To Speach). This makes it easier to change the announcements during testing.

Often program generators are used while programming voice applications, which helps to speed up the coding process.

Testing Phase

During the testing phase, the voice application is tested by a small number of persons which were not involved into the development. It is tested, if these persons get along with the application or if there are problems which need to be solved. In this case, the answers which the computer did not understand will be noted and implemented later.

Operation Phase

The system will be switched live as soon as the customers accepts the status of the interactive voice response system.