David House

Timing of intonation and gestures in spoken communication

The melody of speech, or intonation, plays a crucial role in spoken interaction. By altering the speech melody, speakers can highlight important words and phrases making them prominent and more meaningful. Speakers also make use of changing melodies and rhythms to signal when it is time for the other speakers to talk (turntaking) as well as to give others feedback (such as mm or uhuh). The exact timing of melodies in speech is controlled with considerable precision by the speaker. These movements occur in particular places in relationship to syllables. Body and facial gestures regularly accompany the speech melody and often have the same function as intonation, but until now we have not been able to measure the timing of these gestures with the same precision as intonation. The aim of this research project is to measure with precision the timing relationship between the speech melodies and gestures using a large database of recorded conversations in Swedish. The participants have been recorded using high-quality audio and video and motion capture equipment in a specially designed studio. The results will have implications for our understanding of how speech and gestures are planned and coordinated in the brain, and will also enable better modeling of speech and gestures in such speech applications as robots and avatars.
Final report

Timing of intonation and gestures in spoken communication


1. Main goals and research direction of the project

The main goal of the project was to examine the temporal coordination between speech and gesture. The point of departure was to more specifically investigate the relationship between speech melody represented by intonational movement and co-speech gestures using high-quality audio, video and motion-capture data which allows automatic extraction and analysis of gesture and prosodic aspects of the speech signal. The project has concentrated on this goal and initially investigated the temporal coordination between head nods having a prominence-signaling function and stressed syllables.  During the course of the project, work was also extended to investigating the temporal coordination between hand gestures and syllables which functionally signal potential locations for turn-taking in dialogue. Finally, new methods have been developed and tested within the project for automatic annotation of larger units of speech and gesture from motion data. These methods enable a more extensive investigation of temporal coordination and can be used for building avatars and robots with natural gesturing and gesture recognition capabilities. The project, moreover, had a broader and more general aim of testing the hypothesis that gesture and intonation which occur in synchrony principally involve gestures and intonation with the same communicative function.


2. The three most important results of the project and reasoning around these

The first principle result stemming from the project concerns the temporal synchronization between head nods and tonally marked stressed syllables. The head nods having a prominence lending “beat gesture” on average occurred slightly ahead of the syllable which is consistent with the literature on temporal synchronization of co-speech gestures (mainly hand and arm gestures). These results are especially interesting when compared to the results presented in the literature as the results of this project were obtained from spontaneous dialogue while the literature mainly presents results from scripted speech. The alignment of the head nods with the stressed syllable suggests that the prominence function is common to the intonation movement and the head gesture. However, the greater temporal variation of the head movement compared to prominence lending fundamental frequency movement does not support the hypothesis that there is a common motor generation component for both head and intonational gestures.

The second main result concerns the temporal relationship between hand gestures and syllables which comprise places for potential turn transitions in dialogue. An important relationship between gesture offset timing and turn transition was found. When speakers gesture near a turn boundary location, the gesture will stop before the end of speech when there is a change of turn. When the speaker wants to keep the turn the gestures tend to extend beyond the end of the acoustic speech signal. These results suggest that gesture functions as part of a prosodic system of turn-taking (along with duration and pitch) but also that gesture can function as an independent cue to turn-taking.

The third primary result also concerns hand gestures and the synchronization between gestural units and longer stretches of speech, but even involves development of methodology. The technical area of automatic speech and gesture detection is advancing rapidly, and during the course of the project we have seen a movement away from rule-based detection towards machine-learning methodologies. Our results in this area concern the development of methods to automatically detect and annotate gestural units in spontaneous speech. These methods allowed us to investigate the relationship between speech phrases and gestural units. Our results indicated a general tendency for the onset of speech to slightly precede longer gesture units, thus showing a timing trend contrary to that appearing between head motion and the syllable.

 3. New research questions generated by the project

The initial work of the project involved investigating the temporal coordination between prosody and gesture restricted to the time domain of the syllable. One of the most important and exciting new questions generated by the project relates to coordination between prosody and gesture in a longer time domain and with different functions such as turn-taking. We have found a loose temporal relationship where both stressed syllables and turn-final syllables serve as anchor points between speech and gesture sharing the same functions, but we have also found a relatively large degree of variation and a certain optionality of gestures. How to integrate gesture into a full account of the prosodic system with all of its functions remains a challenging issue.
 
A second area of new research generated by the project is also related to the longer time domain of the gesture unit and involves the development and testing of modeling gestural flow using a Hierarchical Hidden Markov Model (HHMM) instead of a rule-based method. This type of modeling has been tested and validated within the project but needs to be extended and refined to include segmentation of gesture units into gesture phrases and gesture phases.

4. The international impact of the project

The project has been presented at international conferences and received wide attention at an invited plenary presentation at Cambridge, UK (June 2015). Project results have also been presented as invited talks at research seminars in Tilburg, The Netherlands (Oct. 2015); Utrecht, The Netherlands (July 2016) and Aix-en-Provence, France (Oct. 2016).

The project has been represented at three conferences in Sweden and seven international conferences. The Swedish national conferences are Fonetik 2013 (12-13 June, 2013, Linköping), The Fifth Swedish Language Technology Conference (13-14 November 2014, Uppsala) and Fonetik 2015 (8-10 June, 2015, Lund). The international conferences are Tilburg Gesture Research Meeting (19-21 June, 2013, Tilburg University, The Netherlands); The 12th International Conference on Auditory-Visual Speech Processing (AVSP2013) (29 August – 1 September, 2013, Annecy, France); Phonetics and Phonology in Europe 2015 (29-30 June 2015, University of Cambridge, UK); The 14th International Pragmatics Conference (26-31 July 2015, Antwerp, Belgium); Speech Prosody 2016 (31 May – 3 June 2016, Boston, USA); Seventh Conference of the International Society for Gesture Studies (18-22 July, 2016, Paris, France); and International Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (16 November, 2016, Tokyo, Japan). Conference contributions have been accepted and will be presented at the following three upcoming international conferences: International Conference on Multimodal Communication: Developing New Theories and Methods  (9-11 June, 2017, Osnabrück, Germany); Phonetics and Phonology in Europe 2017 (12-14 June, 2017, Cologne, Germany); and 15th International Pragmatics Conference  (16-21 July, 2017, Belfast, Northern Ireland).

In addition to generally strengthening research activities involving gesture and speech at the home department, the project has generated increased collaboration with gesture researchers at Lund University and the University of Copenhagen particularly within the project, “Multimodal levels of prominence” supported by Stiftelsen Marcus och Amalia Wallenbergs Minnesfond, in which David House is a participant.  

5. Dissemination of information outside the scientific community

Project results concerning speech and gesture have been presented at popular science events in Gothenburg organized by the SweClarin initiative. The project has also been in contact with Disney Research, USA, regarding animation technology.

6. The two most important publications of the project and some reflections

Zellers, House & Alexanderson (2016) is the most important publication concerning the temporal relationship between hand gestures and turn transitions. An important finding is that when speakers gesture in the vicinity of a potential turn boundary location, the gesture will end before the offset of speech when there is a change of turn. When the speaker wants to keep the turn, the gesture tends to continue on after the end of speech, but with a fairly constant end time of about half a second into the pause. The paper also reports on results showing a higher and more variable pitch at the end of speech in connection with gestures. These results suggest that gesture functions as part of a prosodic system of turn-taking.

Alexanderson, House and Beskow (2016) is the most important publication involving the development and testing of modeling gesture dynamics using a Hierarchical Hidden Markov Model (HHMM) instead of a rule-based method. The model is trained based on labels of complete gesture units and tested and validated on two datasets differing in genre and in method of capturing motion. The method outperforms a state-of- the-art classifier on a publicly available datatset. The results have implications for automatic classification of gesture units and for building avatars and robots with natural gesturing and gesture recognition capabilities.

7.  Publication strategy of the project

The publication strategy has been to publish full-paper refereed conference reports (4), short paper reviewed conference papers (5) and full-paper non-referred conference contributions (2). All these papers are open access and freely available on the project webpage and the personal webpages of the authors. Two journal publications have been submitted. If accepted, these publications will be open access also to be made available on the webpages of the project and the authors.  

Grant administrator
KTH Royal Institute of Technology
Reference number
P12-0634:1
Amount
SEK 2,771,000
Funding
RJ Projects
Subject
General Language Studies and Linguistics
Year
2012