Talking Point – A New Era of Speech Recognition

June 30, 1999

Since its beginnings, speech recognition technology has suffered from aless-than-respectable reputation. There are many reported examples ofinaccuracies in work produced by speech recognition systems – such as BillGates’ famous ‘wreck a nice beach’ (recognise speech) reference – whichhave led to a widespread cynicism about the viability of a technology that cantransform human speech into text.

Furthermore, as a method of producing work, speech recognition has becomerenowned as something of a non-starter, failing to deliver the many time- andcost-saving benefits it promised. Today, however, a new generation of speechrecognition systems is entering the marketplace, this time with the realpotential to change the way people work.

There are many differences between today’s systems and those that have gonebefore. While the development of the earliest systems was fuelled by a corporatedesire to reduce workload and increase efficiency, it is the consumermarketplace which is driving today’s demand. This has led to the creation ofvery different systems catering for the specific requirements of each sector.

Within the consumer sector, the rapid increase in the number of computers inthe home, along with the growing use of the Internet and e-mail, has causedmanufacturers to eagerly enter the marketplace with voice recognition packagestailored towards consumers’ needs. As a result, many companies have producedoff-the-shelf packages offering speech recognition in the same way as any otherPC add-on. This high-profile activity is attracting a large proportion of‘me-too’ consumers, who are keen to keep up to date with the latesttechnologies.

However, many of these systems do not provide the level of accuracy which isrequired for high-volume or mission-critical work. While this is not a concernfor home users, it is an essential factor for businesses specifying technologywhich will ultimately impact on their bottom-line.

For example, many of these systems, which require users to dictate directlyinto Word, only keep the audio on a temporary file. Once this has been deleted,the secretary does not have access to the audio, meaning that corrections to thefinal document have to be made by the original author. When fee-earningprofessionals have to spell-check, paginate and format their own work, theyeffectively become highly expensive typists. This negates any cost and timesaving benefits originally offered by the system.

Furthermore, if dictation is less than 95 per cent accurate when firstrecognised by a software system, it remains easier to transcribe fromtraditional audio sources than for the author or secretary to clean the untidydocument. Five words in every hundred is a large margin of error which firmssimply cannot afford, and very few of them would employ a secretary who madethat many mistakes.

There are other factors which also explain why many businesses remainhesitant to adopt this sort of technology. The history that lies behind businessspeech recognition systems is an important one. The first installations withinthe business sector took place in the US at the end of the 1980s, whenbusinesses often paid upwards of US$20,000 to install a single-user application.At that time, it seemed that the Holy Grail of dictation had been reached, butafter failing to perform the tasks they were sold on, these systems werediscarded a couple of years later. The capital outlay and subsequent lack ofreturns involved with these systems meant that many businesses became cynicalabout the reality of speech recognition technology.

In today’s environment, however, where the success of any business in anysector is measured on its ability to reduce costs while improving productivity,speech recognition is once again becoming a consideration. A system that reallycould deliver what it promises would reap massive savings and would freeexecutives to concentrate on fee-earning work. And this is exactly whattoday’s systems offer.

In contrast to previous speech recognition technologies, manufacturers havedeveloped today’s systems around the client-server architecture. This meansthat systems can cater for multiple users, all working off a central network.The most advanced speech recognition systems can also be fully integrated with acomprehensive voice and data management system, which can sort and managedocuments generated by a variety of voice input methods. This has the potentialto offer savings, in both time and cost, for companies which work with largevolumes of dictated material.

For individual users, there are also many benefits to be gained fromserver-based systems. Firstly, dictation can take place in the same way as italways has done, either via a portable machine, PC-based dictation systems, or atelephone (although many people still have doubts about the quality ofrecognition across a telephone line). This means that, by and large, users donot have fundamentally to change their working practices.

Additionally, compared to older voice recognition systems, server-basedsystems do not require hours of ‘training’, during which the computer has to‘learn’ to interpret the user’s voice. This used to take anything up to 12hours, which executives would often have to spread across two or three weeks inorder to continue with their main work duties. Such a time-consuming processoften became a key factor in the rejection of voice recognition systems.Today’s server-based systems, however, are virtually independent of thespeaker. They can be trained in around 12 minutes of what is called‘registration time’, which many manufacturers believe will be reduced evenfurther as systems develop.

However, the most practical benefit of server-based systems is that they havethe capacity to incorporate specific, pre-programmed vocabularies, or language‘contexts’. This improves the speed of transcription, converting speech totext in near real-time. Furthermore, it allows for greatly improvedtranscription accuracy by using statistical probability to identify words whichtypical follow others. Each percentage increase in accuracy means that systemsare more cost-effective and will become a viable option for a wider range ofdictation users. It also means that they can more successfully cater for thespecific groups of users who spend most of their time carrying out dictation.Lawyers, for example, can opt for contexts dependent on their specialisation –personal injury, employment law and litigation are just some examples.

However, in markets where the adoption of new technologies is traditionallyslow, convincing users about the real business benefits of a new system stillposes a serious challenge for manufacturers. In the legal sector, dictation haslong been relied on as a working method, and traditional tape-based dictationstill accounts for a large proportion of the market. Even the transition fromanalogue-based systems to those using digital technology has represented asignificant cultural shift for many practices.

Server-based speech recognition, then, relying on a practice having a PCnetwork to which all users are connected, represents an even bigger step – andone which many practices, until now, have been unwilling to take. The cost ofinstalling such a network and training users to a sufficient standard requiressubstantial investment, which, unless balanced by real savings later on, mayprevent many firms adopting such technologies. Those practices that made thismove with older systems are often in the process of running trials on a fourthor fifth system, but are nowhere nearer to finding a workable solution to deployon an enterprise-wide scale. This may dissuade many firms from even attemptingto integrate new systems. As a case in point, there is still no law firm in theUK that uses speech recognition as its single mode of work production.

However, as we enter the millennium, this situation is set to changedramatically. The widespread availability of digital methods of dictation, whichfar exceed the performance of analogue systems, will draw many into a new era ofworking practices. Many law firms are realising that PC-based technologies arethe only viable method of dealing with increasing workloads and reducing theamount of paperwork that has to be dealt with every day. More importantly, theyare recognising that the cost savings to be gained from these systems more thancompensate for any initial capital outlay. Furthermore, this change, whilegradual, will have a marked effect on the number of firms adopting othertechnologies which complement their PC-based and network systems.

From a manufacturer’s point of view, today’s speech recognition systemsare exactly suited to the marketplace, with systems development being driven bythe applications they serve. And while widespread use of speech recognitionremains just out of sight, many sectors will begin to benefit from speechrecognition technologies specifically tailored to their needs. This means that awide range of professionals will have the option of technologies which cangreatly increase their productivity, streamline their operations and freeexecutives’ time, and – most importantly – which can offer the massivecost savings that previous systems could only talk about. In a future whereevery organisation and profession faces increasing competition, it will be thepower of voice that provides the ultimate competitive advantage.