Prosody in conversation
The visionary goal for the researchers behind Prosody in conversation is to create an artificial conversational partner. While this is far beyond the scope of the project, this goal has already revealed black holes in our knowledge [MH1] concerning human conversations. For example, state-of-the-art speech technology neither sounds like a conversational partner, nor understands fundamental aspects of human conversational behavior.
Prosody in conversation aims at filling some of these black holes by studying features that are essential for as well as exclusive to conversations. More precisely, the project will explore how people talking to each other jointly decide ‘who should speak when’ and the role of prosody - or the rhythm and melody of speech - for making such decisions.
The project will study acoustic prosodic features in genuine human- human conversations, as well as the effects of using or introducing such features in conversations. The latter type of studies will include manipulation of genuine conversations, as well as introduction of prosodic features (e.g. in talking computers) together with recordings of conversational behavior in response to such features, and subjective opinions of the effects of the features.
The cross-disciplinary research team has a solid background in the analysis of prosody in conversation.
Mattias Heldner, Linguistics, Stockholm University
2009-2013
The purpose of the project was to deepen our knowledge about prosodic characteristics that are specific for conversation, and that have an interactional function. The main track of the project, therefore, has been to investigate and model rhythmic patterns and intonation patterns in connection to turntaking and verbal feedback in different Swedish and English spoken language databases. The project has also made efforts to verify effects of observed prosodic characteristics through perception and production tests, as well as through the generation of interactional behaviours in human-robot interaction in collaboration with the Furhat project. No noteworthy changes to the purpose of the project has been made during the project duration.
The three most important results of the project, and reasoning around these
The three most important results of the project are summarized in these bullets:
(i) Quantitative descriptions of the prosody of conversation (e.g. timing of feedback vocalizations/backchannels in relation to the speech of the interlocutor, prosodic patterns inviting feedback, how frequent feedback may occur, prosodic realisation of feedback and how these vocalizations are adapted to the prosody of the interlocutor);
(ii) Stochastic models of turntaking in interaction given prosodic characteristics (e.g. speech, silence, overlap, speaking rate change, intonation patterns, intensity patterns) providing a framework for an artificial speaker to understand and produce a more humanlike conversational behaviour; and
(iii) Implementation of prosodic behaviour in a physical realisation of an embodied conversational agent (ECA) for verification of the effects of conversation specific prosodic characteristics.
Taken together, these three items represent a large step forward on our long-term goal to create an artificial conversational partner that actually sounds and behaves as if it is participating in a conversation.
New research questions generated by the project
The project plan focuses on prosodic characteristics in conversation, and how these are used in the interaction. Throughout the project, the notion that we lack knowledge of the relation between prosody and the non-prosodic, sometimes on-verbal, characteristics that are also greatly relevant and important for conversational face-to-face interaction, for example the interlocutors' gaze and head movements, facial expressions, and breathing patterns. We have begun investigating this type of information and are in the process of applying for project funds for the investigation of breath, gaze and head pose in conversation this year (2014).
International consolidation
The project results have made a quite strong international impact. As a particularly striking example, our collaborative work with Professor Julia Hirschberg at Columbia University in New York was a key part of the keynote speech that she presented at the - in our field of paramount importance - Interspeech conference in Florence. 2011.
Another example of the project's impact is that Mattias Heldner ad Jens Edlund in 2014 are invited to participate in a EU-COST network proposal entitled Dialogue Interaction Across diverse Languages as a result of the project work. Jens Edlund has also (co)organized a number of relevant conferences (The 15th ACM International Conference on Multimodal Interaction (ICMI), 2013 and Speech Prosody, the 7th biennial meeting of the Speech Prosody Special Interest Group, 2014) and workshops (e.g. ICT Workshop on Overlap in Human-Computer Dialogue in Los Angeles, CA, US, 2011; Workshop on Multimodal Corpora, 2012, 2013, 2014; The Interdisciplinary Workshop on Feedback Behaviors in Dialog, 2012; Real-time Conversations with Virtual Agents, 2012; The sixth International workshop on Disfluences in Spontaneous Speech (DiSS), 2012 and Breathing in Speech and Spoken Interaction, 2014).
The dissemination of project results has been helped greatly by many invitations to present project research in various contexts. This includes international summer schools (CLARA Summer School on Semantic and Multimodal Annotation, Copenhagen, 2011 and Summer School in Social Signal Processing, Mullsjö, 2013), invited talks (Carnegie Mellon Silicon Valley branch, US, 2011; Honda Research, Mountain View, US, 2011; Trinity College, Dublin, 2011; The Beckman Institute/University Urbana-Champaign, Illinois, US, 2012; Columbia University, NYC, 2013; University in Debrecen, Hungary, 2013; QMUL, London, 2014; and GIPSA-lab, Grenoble, 2014); and keynote speeches at The 11th International Conference on Intelligent Virtual Agents (IVA), Reykjavik, Iceland, 2011 and at The 3rd International Workshop on Laughter and Other Non-verbal Vocalisations, Dublin, Ireland, 2013.
The project results also play an important role for several on-going and recently completed dissertation projects around the world (e.g. Rivka Levitan, Columbia University; Iwan de Kok, University of Twente; Marcin W?odarczak och Hendrik Buschmeier, Universität Bielefeld; Zofia Malisz, AMU Poznan; Catharine Oertel, KTH). Mattias Heldner and Jens Edlund have on several occasions acted as opponent, examiner or grading committee members at dissertations as a direct result of project resutls being a central part of the dissertation. Catharine Oertel came to KTH as a PhD student and Marcin W?odarczak recetly began a Post Doc at Stockholm University largely due to contacts with the project.
Furthermore, all listed publications except the two that were published at the Swedish conferences Fonetik and SLTC went through peer review by internationally acknowledged researchers.
Finally, we have throughout the project had the great privilege to work with Kornel Laskowski (former Carnegie Mellon University, now Voci Technologies, Inc.). This collaboration is, after the project's completion, still active - something we consider to be very valuable.
Dissemination outside the scientific community
Project activities have been confined, largely, to the scientific community, but Mattias Heldner was interviewed about voice and interaction for a popular science show, Kärlekskoden, which is to be broadcast by Sveriges Television in 2014.
The robotic head Furhat, which we have used in a couple of investigations, has earned rather a lot of attention outside the scientific community. Furhat was for example seen on SVT Rapport in April 2013; introduced a panel debate at Tällberg Forum in June 2012; was displayed at RobotVille at London Science Museum in December 2011; and made it into a number of news broadcasts, including the BBC, as a result.
Two key publications and some reflections on these
We hold Heldner, Hjalmarsson, & Edlund (2013) as the most important publication in the area of descriptions of prosody in conversation. The article has been rewarded with a lot of attention (e.g. citations and invitations to international symposia and workshops). We used innovative methods and presented entirely original results regarding possible (but not necessarily actual, or exploited) placed for backchannels.
Further, we hold Laskowski, Edlund, & Heldner (2011b) as the most important publication in the area of stochastic models of turntaking in interaction based on prosodic characteristics. The article presents a framework for modelling turntaking in multiparty dialogue which is suitable for use in artificial conversational partners.
Publication strategy and comments
We have mainly published at international scientific conferences with peer review (particularly at the main conference of our field, Interspeech) and in some cases at Nordic or national conferences. Open access is secured: all project publications are included in the digital scientific archive DiVA, in some cases in so called 'author's versions'. The publications are also accessible through the project home page, and through the project participants' personal home pages.
List of publications, and links to web pages
The project web pages are found at: www.speech.kth.se/sampros/
Publikationslista
Al Moubayed, Samer, Edlund, Jens, & Gustafson, Joakim. (2013). Analysis of gaze and speech patterns in three-party quiz game interaction. In Proceedings Interspeech 2013 (pp. 1126-1130), Lyon, France: ISCA.
Beskow, Jonas, Edlund, Jens, Gustafson, Joakim, Heldner, Mattias, Hjalmarsson, Anna, & House, David. (2010a). Modelling humanlike conversational behaviour. In The third Swedish language technology conference (SLTC-2010), Linköping, Sweden: SLTC.
Beskow, Jonas, Edlund, Jens, Gustafson, Joakim, Heldner, Mattias, Hjalmarsson, Anna, & House, David. (2010b). Research focus: Interactional aspects of spoken face-to-face communication. In Proceedings from Fonetik 2010 (pp. 7-10), Lund.
Edlund, Jens. (2011). In search of the conversational homonculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH, Stockholm, Sweden.
Edlund, Jens, Heldner, Mattias, & Gustafson, Joakim. (2012a). On the effect of the acoustic environment on the accuracy of perception of speaker orientation from auditory cues alone. In Proceedings Interspeech 2012 (pp. pages not numbered), Portland, OR, USA: ISCA.
Edlund, Jens, Heldner, Mattias, & Gustafson, Joakim. (2012b). Who am I speaking at? Perceiving the head orientation of speakers from acoustic cues alone. In LREC 2012 Workshop: Multimodal Corpora: How Should Multimodal Corpora Deal with the Situation? (pp. 38-41), Istanbul, Turkey: LREC.
Heldner, Mattias, Edlund, Jens, & Hirschberg, Julia. (2010). Pitch similarity in the vicinity of backchannels. In Proceedings Interspeech 2010 (pp. 3054-3057), Makuhari, Japan: ISCA.
Heldner, Mattias, Edlund, Jens, Hjalmarsson, Anna, & Laskowski, Kornel. (2011). Very short utterances and timing in turn-taking. In Proceedings Interspeech 2011 (pp. 2837-2840), Florence, Italy: ISCA.
Heldner, Mattias, Hjalmarsson, Anna, & Edlund, Jens. (2013). Backchannel relevance spaces. In E. L. Asu & P. Lippus (Eds.), Nordic Prosody: Proceedings of the XIth Conference, Tartu 2012 (pp. 137-146), Frankfurt am Main: Peter Lang, Germany.
Hjalmarsson, Anna. (2010). The vocal intensity of turn-initial cue phrases and filled pauses in dialogue. In Proceedings of SIGdial, Tokyo, Japan: SIGdial.
Hjalmarsson, Anna, & Laskowski, Kornel. (2011). Measuring final lengthening for speaker-change prediction. In Proceedings Interspeech 2011 (pp. 2065-2068), Florence, Italy: ISCA.
Laskowski, Kornel. (2012). Exploiting loudness dynamics in stochastic models of turn-taking. In Proceedings of the 4th IEEE Workshop on Spoken Language Technology (SLT2012) (pp. 79-84), Miami, FL, USA: IEEE.
Laskowski, Kornel, Edlund, Jens, & Heldner, Mattias. (2011a). Incremental learning and forgetting in stochastic turn-taking models. In Proceedings Interspeech 2011 (pp. 2069-2072), Florence, Italy: ISCA.
Laskowski, Kornel, Edlund, Jens, & Heldner, Mattias. (2011b). A single-port non-parametric model of turn-taking in multi-party conversation. In Proceedings ICASSP 2011 (pp. 5600-5603), Prague, Czech Republic.
Laskowski, Kornel, Heldner, Mattias, & Edlund, Jens. (2010). Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass. In Proceedings of the NIPS Workshop on Modeling Human Communication Dynamics, Whistler, British Columbia, Canada: NIPS.
Laskowski, Kornel, Heldner, Mattias, & Edlund, Jens. (2012). On the dynamics of overlap in multi-party conversation. In Proceedings Interspeech 2012 (pp. pages not numbered), Portland, OR, USA: ISCA.
Laskowski, Kornel, & Jin, Qin. (2011). Harmonic structure transform for speaker recognition. In Proceedings Interspeech 2011 (pp. 365-368), Florence, Italy: ISCA.
Oertel, Catharine, Włodarczak, Marcin, Tarasov, Alexey, Campbell, Nick, & Wagner, Petra. (2012). Context cues for classification of competitive and collaborative overlaps. In Speech Prosody 2012 (pp. 721-724), Shanghai, China.
Skantze, Gabriel, Oertel, Catharine, & Hjalmarsson, Anna. (2013). User feedback in human-robot interaction: Prosody, gaze and timing. In Proceedings Interspeech 2013 (pp. 1901-1905), Lyon, France: ISCA.