Theme 2: Speech and Audio Context Recognition

Theme 2: Speech and Audio context recognition

Theme Leader: Gilles Boulianne, MSc.
Research Contributors: CRIM, McGill University (Prof. Richard Rose), École de technologie supérieure (Prof. Pierre Dumouchel)

As more cultural content is created and manipulated in electronic form, speech recognition technologies look increasingly attractive, having the potential to greatly facilitate some of the most labour-intensive aspects of media production and post-production, such as post-synchronization, captioning, script correction, or subtitling. Access and understandability of mainstream media such as television broadcasts and educational material are still the most immediate and urgent needs for the Deaf, deafened, and Hard of Hearing (D&HH) community. But addressing these needs is also providing spin-off benefits in accessing media such as cinema, Internet Webcasts and Podcasts, theatre, and experiencing the audio in the world around us. During the next two years we will therefore pursue research and development that was started in the first two projects, with the goal of improving the quality of closed-captioning and expanding applicability to a wider range of acoustic environments and conditions. The four new projects have a high potential to meet the needs of the D&HH user community, and in some cases, the general public as well.

RESEARCH NETWORK ON

  • Project 2.1: Core speech technology improvements

    Core speech recognition technologies had to be improved and adapted to cultural content in order to reach a useful level of accuracy. We introduced new algorithms and improvements in front-end processing, acoustic model and feature adaptation, vocabulary selection, and statistical language and grammar modelling. Progress was measured on large databases of speech recorded by voice writers during actual production of real-time TV closed-captions, along with an additional large database consisting of recorded TV shows from various broadcasters. Improvements were incorporated into closed-caption production and submitted to deaf and hard of hearing audiences for evaluation and guidance. We described this work in detail in 6 technical reports and 5 papers presented in international peer-reviewed conferences.

  • Project 2.2: Speech & Audio Context Recognition

  • Especially in cultural settings, it is not only speech but the whole audio context which conveys relational and emotional content. While audio context is mostly ignored in current speech recognition research, this project has investigated techniques to identify, classify and exploit this information to improve accuracy and provide end-users with a more complete access to cultural content. Researchers from CRIM, McGill University, and École technologie supérieure worked together  to create prototypes and demonstrations showing the importance of relational and emotional content of speech and of other information present in the sound signal on the understanding of a speaker’s meaning. Prototypes and demonstrations of systems for automatic indexing of spoken audio material were also created. Software for automatic segmentation into speech, music or noise, speaker identification and tracking, speaker verification, and synchronisation of a theatre script to an actor's speech were developed in the course of the project and prototypes were demonstrated at the E-inclusion technology showcase. A Web site was created which allows keyword search of spoken content inside McGill’s COOL on-line lecture courses1. This work led to the publication of 5 scientific papers, 1 magazine article, and 3 technical reports.
  • Project 2.3: Anything, anywhere: live captioning technology

    The project’s goal is to enhance the capabilities of our speech recognition based, live captioning technology, so that it becomes useful for live plenary sessions of conferences, trade shows or large meetings, which are one-time events taking place in remote locations and for which very little data can be obtained in advance.


    As an example of the need for live captioning of conferences, consider an event such as the 14th World Congress of the World Federation of the Deaf, held in Montreal in 2003. Providing live captioning in both French and English for such an event was considered a major achievement and earned an International Communications Industry award. Today such a feat has become impossible, due to the shortage of stenographers, and might remain impossible unless we succeed in this project. 

  • Project 2.4: Course lecture e-accessibility: e-Learning for the sensory disabled

    This project will build upon the E-Inclusion Phase 1 Lecture Transcription project by applying previous research to a large and realistic task.  It will explore new paradigms in user interface (UI) design to allow people with vision loss to search and navigate course material and hear lectures; and providing lecture audio access to people with hearing loss by synchronizing transcriptions of the lecture’s audio to the course slides. The project will automate the transcription and indexation of McGill’s COurses OnLine (COOL) multimedia lecture Web site, so that it can be navigated and used by all Canadians, including the sensory disabled, for e-Learning.

    The general idea is that a student can specify a particular search term through an appropriately designed UI. The search term is converted to some acoustic representation and matched against the hundreds of lectures that are online. The student is then presented a list of interesting material alternatives which can be navigated and selected for viewing or listening. The Web site is navigated following W3C Web Accessibility practices for people with vision and hearing loss.  

  • Project 2.5:  Collaborative captioning

    The E-Inclusion collaborative captioning Web portal will be the focal point for a central repository of captions. This database will be populated in a collaborative fashion - Wiki-style - by Internet users through a simple interface.  Upon selection of an audiovisual or audio-only document anywhere on the Web, a browser applet will look up the file in the repository, and indicate if captions are available.  If so, they will be displayed in a window applet as the playback proceeds, in synchronization with the audio.
     
    With the D&HH community taking ownership and control over captions, and its architecture of participation and democracy, this project is a typical Web 2.01 site. Such user-maintained online databases already have had a huge success: the Internet Movie Database (in operation since 1990, it provides information about over 889,000 movies), CDDB, freedb or MusicBrainz (the latter now contains over 439,000 CD titles and 5.2 million track titles), not to mention the well-known Wikipedia.

  • Project 2.6:  Assisted indexation

    Speech technologies developed during previous projects will be combined into a prototype to amplify the human ability to identify audio content; in the same sense a bulldozer multiplies the workforce of its operator, in order to dramatically reduce the time-intensive manual task of indexing large audiovisual archives. The e-accessible system includes a thin web client user interface for assisted indexation, audio segmentation, speech and speaker segmentation, speech, speaker and emotion recognition, indexation and search modules.

    There are vast amounts of archived Canadian cultural content on the Web which is not accessible to the D&HH. Our history is our heritage, and these archives should be available to all Canadians. A large multimedia archive such as collected by the Canadian Broadcasting Corporation (CBC) - Radio-Canada, Canada National Archives,  CEDROM-SNI, or the National Film Board, is targeted for the experiments.


AttachmentSize
theme2_mar2006_small.pdf1.11 MB
einc2_presentation_mars_2008_v30.pdf833.46 KB