Research

Produce

Perceive

Learn

1. Produce

How do speakers talk to technology? For example, our experiments have found that speakers adapt the acoustic properties of their speech (speech rate, pitch, intensity) when they are talking to a voice assistant, compared to another person. But many of the ways speakers adapt to local communicative contexts (e.g., a misunderstanding) appear to be parallel in human-human and human-computer interaction.

Adapting speech for voice assistants / errors

Cohn, M., Mengesha, Z., Lahav, M., & Heldreth, C. (2024). African American English
speakers’ pitch variation and rate adjustments for imagined technological and human
addressees. Journal of Acoustical Society of America (JASA) Express Letters, 4(4).
Cohn, M., Ferenc Segedin, B., & Zellou, G. (2022). The acoustic-phonetic properties of Siri- and human-DS: Differences by error type and rate. Journal of Phonetics. [OA Article]
Cohn, M., & Zellou, G. (2021). Prosodic differences in human- and Alexa-directed speech, but similar error correction strategies. Frontiers in Communication. [OA Article]
Cohn, M., Liang, K., Sarian, M., Zellou, G., & Yu, Z. (2021). Speech rate adjustments in conversations with an Amazon Alexa socialbot. Frontiers in Communication [OA Article]
Cohn, M., Barreda, B., Graf Estes, K., Yu, Z., & Zellou, G. (in prep). Talking to technology: Children and adults produce distinct acoustic adjustments.
Cohn, M., Pycha, A., & Zellou, G. (in prep). Real versus imagined addressees: Prosodic differences across human- and device-directed speech.
Beier, E., Cohn, M. (co-first authors), Trammel, T., Ferreira, F., & Zellou, G. (accepted). Marking Prosodic Prominence for Voice-AI and Human Addressees. [PsyArXiv]
Perkins Booker, N., Cohn, M., & Zellou, G. (2024). Linguistic Patterning of Laughter in Human-Socialbot Interactions. Frontiers in Communication, 9, 1346738

2. Perceive

How do people perceive speech produced by text-to-speech (TTS) voices? For example, people often better understand human voices than TTS voices, but it depends on whether listeners think it’s a human or a device. At the same time, we find that people respond to emotion in human and TTS voices in parallel ways.

Perceiving text-to-speech (TTS) voices

Cohn, M., Pushkarna, M., Olanubi, G., Moran, J., Padgett, D., Mengesha, Z., & Heldreth, C. (2024). Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on Trust in Large Language Models. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Honolulu, United States. [pdf][video]
Cohn, M. & Zellou, G. (2020). Perception of concatenative vs. Neural text-to-speech (TTS): Differences in intelligibility in noise and language attitudes. Interspeech [pdf] [Virtual Talk]
Cohn, M., Pushkarna, M., Olanubi, G., Moran, J., Padgett, D., Mengesha, Z., & Heldreth, C. (2024). Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on Trust in Large Language Models. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Honolulu, United States.
Aoki, N., Cohn, M., & Zellou, G. (2022). The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise. Journal of Acoustical Society of America (JASA) Express Letters. [OA Article]
Cohn, M, Sarian, M., Predeck, K., & Zellou, G. (2020). Individual variation in language attitudes toward voice-AI: The role of listeners’ autistic-like traits. Interspeech [pdf] [Virtual talk]
Zellou, G., Cohn, M., & Block, A. (2021). Partial compensation for coarticulatory vowel nasalization across concatenative and neural text-to-speech. Journal of the Acoustic Society of America [Article]
Block, A., Cohn, M., & Zellou, G. (2021). Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-produced and TTS Voices. Interspeech. [pdf]

Responses to emotion from voice technology

Cohn, M., Predeck, K., Sarian, M., & Zellou, G. (2021). Prosodic alignment toward emotionally expressive speech: Comparing human and Alexa model talkers. Speech Communication. [OA Article]
Cohn, M., Bandodkar, G., Sangani, R., Predeck, K., & Zellou, G. (2024). Do People Mirror Emotion Differently with a Human or TTS Voice? Comparing Listener Ratings and Word Embeddings. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Honolulu, United States. [pdf][video]
Cohn, M., Raveh, E., Predeck, K., Gessinger, I., Möbius, B., & Zellou, G. (2020). Differences in Gradient Emotion Perception: Human vs. Alexa Voices. Interspeech [p df] [Virtual talk]
Cohn, M., & Zellou, G. (2019). Expressiveness influences human vocal alignment toward voice-AI. Interspeech [pdf]
Cohn, M., Chen, C., & Yu, Z. (2019). A Large-Scale User Study of an Alexa Prize Chatbot: Effect of TTS Dynamism on Perceived Quality of Social Dialog. SIGDial [pdf]
Gessinger, I., Cohn, M., Möbius, B., & Zellou, G (2022). Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices. Interspeech [pdf].
Zhu, Q., Chau, A., Cohn, M., Liang, K-H, Wang, H-C, Zellou, G., & Yu, Z. (2022). Effects of Emotional Expressiveness on Voice Chatbot Interactions. 4th Conference on Conversational User Interfaces (CUI). [pdf]

3. Learn

How do people learn speech patterns from technology? For example, we found that the type of talker— as an apparent human or a voice assistant — shapes how listeners learn a novel shift and how they mirror another speaker’s pronunciation patterns.

Learn a novel pattern

Zellou, G., Cohn, M., & Pycha, A. (to appear). The effect of listener beliefs on perceptual learning. Language.
Cohn, M., Graf Estes, K., & Zellou, G. (in prep). Learning lexical tone from concatenative and neural text-to-speech (TTS) voices.
Ferenc Segedin, B. Cohn, M., & Zellou, G. (2019). Perceptual adaptation to device and human voices: learning and generalization of a phonetic shift across real and voice-AI talkers. Interspeech [pdf]

Mirror/align/imitate another speakers’ patterns

Cohn, M., Keaton, K., Beskow, J., & Zellou, G. (2023). Vocal accommodation to technology: The role of physical form. Language Sciences 99, 101567. [OA Article]
Cohn, M., Ferenc Segedin, B., & Zellou, G. (2019). Imitating Siri: Socially-mediated vocal alignment to device and human voices. ICPhS [pdf]
Cohn, M., Jonell, P., Kim, T., Beskow, J., & Zellou, G. (2020). Embodiment and gender interact in alignment to TTS voices. Cognitive Science Society [OA Article] [Virtual talk]
Dodd, N., Cohn, M., & Zellou, G. (2023). Comparing alignment toward American, British, and Indian English text-to-speech (TTS) voices: Influence of social attitudes and talker guise. Frontiers in Computer Science, 5. [Article]
Zellou, G., Cohn, M., & Ferenc Segedin, B. (2021). Age- and gender-related differences in speech alignment toward humans and voice-AI. Frontiers in Communication [OA Article]
Zellou, G., Cohn, M., & Kline, T. (2021). The Influence of Conversational Role on Phonetic Alignment toward Voice-AI and Human Interlocutors. Language, Cognition and Neuroscience [Article][pdf]
Zellou, G., & Cohn, M. (2020). Top-down effects of apparent humanness on vocal alignment toward human and device interlocutors. Cognitive Science Society [pdf]
Snyder, C. Cohn, M., & Zellou, G. (2019). Individual variation in cognitive processing style predicts differences in phonetic imitation of device and human voices. Interspeech [pdf]

Other work (human-human interaction)

Face-masked speech (i.e., when a person is wearing a fabric face-mask, how do they adapt their speech when asked to speak ‘clearly’ vs. ‘casually’?). We’ve found that speakers produce even clearer speech when wearing a mask (compared to no-mask). But when listeners know the speaker was wearing a mask, this can have a detrimental effect on their ability to understand the speaker.

Cohn, M., Pycha, A., & Zellou, G. (2021). Intelligibility of face-masked speech depends on speaking style: Comparing casual, smiled, and clear speech. Cognition [Article] [pdf]
Cohn, M., Pycha, A., & Zellou (in prep). Children’s adaptations across face-masked and unmasked speech
Pycha, A., Cohn, M., & Zellou, G. (2022). Face-masked speech intelligibility: the influence of speaking style, visual information, and background noise. Frontiers in Communication. [OA Article]
Zellou, G., Pycha, A., & Cohn, M. (2023). The perception of nasal coarticulatory variation in face-masked speech. The Journal of the Acoustical Society of America, 153(2), 1084-1093. [Article]

Perception and production of phonetic detail in human-human interaction

Speakers and listeners show fine-grained control over the cues they produce and perceive. We’ve found that cue weighting varies based on age, dialect, as well as by an individual’s nonlinguistic experience (e.g., musical training).

Cohn, M., & Zellou, G. (2023). Selective tuning of nasal coarticulation and hyperarticulation across clear, casual, and fast speech styles. Journal of the Acoustical Society of America (JASA) Express Letters 3(12)
Zellou, G. & Cohn, M., (2024). Apparent-time variation in the use of multiple cues for perception of anticipatory nasal coarticulation in California English, Glossa: a journal of general linguistics 9(1)
Cohn, M., Barreda, S., & Zellou, G. (2023) Differences in a musician’s advantage for speech-in-speech perception based on age and task. Journal of Speech Language, and Hearing Research. [Article] [pdf]
Cohn, M., Zellou, G., & Barreda, S. (2019). The role of musical experience in the perceptual weighting of acoustic cues for the obstruent coda voicing contrast in American English. Interspeech [pdf]