From sensorimotor dynamics, to joint attention and sustained attention, and to word learning A multimodal pathway

Interacting embodied agents, be they groups of adult humans engaged in a coordinated task, autonomous robots acting in an environment, or a mother teaching a child, must seamlessly coordinate their actions to achieve a collaborative goal. The pursuit of a shared goal requires mutual recognition of the goal, appropriate sequencing and coordination of each agent's behavior with others, and making predictions from and about the likely behavior of others. The goal of this project is to invent quantitative tools and methods to describe, monitor and shape the coordinated sensorimotor behavior of interacting agents. Most theories of learning and communication in both cognitive science and computer science have focused on macro level descriptions, conceptualizing the phenomenon in terms of inferences about the agent's intentions and so-called mind reading. These descriptions may capture higher level regularities, but they fall far short of a mechanistic account of how learning and interaction happen in real time. We know that young children, for instance, learn words through millisecond by millisecond, second by second, and minute by minute sensorimotor events that are generated by actively engaging in the world, with objects, and with their social partners who offer object names, gestures and actions. Our research suggests that the smoothness of these dynamic couplings -- both within the sensorimotor system of the individual and across the coupled dyad -- are critical components of toddlers' prowess in word learning in the cluttered and noisy contexts of everyday life. We seek to describe these sensorimotor cross-agent dependencies using state-of-the art computational and sensing techniques.

Statistical Cross-Situational Learning

There are an infinite number of possible word-to-word pairings in naturalistic learning environments. Previous proposals to solve this mapping problem have focused on linguistic, social, representational, and attentional constraints at a single moment. This article discusses a cross-situational learning strategy based on computing distributional statistics across words, across referents, and, most important, across the co-occurrences of words and referents at multiple moments. Our general experimental paradigm is like this: we briefly exposed adult or young learners to a set of trials that each contained multiple spoken words and multiple pictures of individual objects; no information about word-picture correspondences was given within a trial. Nonetheless, over trials, subjects learned the word-picture mappings through cross-trial statistical relations. In various learning conditions we designed, we found that overall both young children and adults calculate cross-trial statistics with sufficient fidelity and by doing so rapidly learn word-referent pairs even in highly ambiguous learning contexts. Our present studies focus on understanding underlying statistical computations and as well as expanding this paradigm in various related language and cognitive learning tasks.

Egocentric Vision

A key component of the human visual system is our attentional control the selection of which visual stimuli to pay attention to at any moment in time. Understanding visual attention in children could yield new insight into how the visual system develops during formative years and how their visual attention and selection play a role in development and learning. We use head-mounted cameras to record first-person video from interacting children and parents, giving a good approximation of the contents of their visual fields of view, and collect gaze direction data to record where they look within the visual field. We data-mine this data to study the distributions of gaze patterns within the first-person visual frame for both children and adults. We also study the ability of visual saliency to predict visual attention, as a function of the tasks, actions, and interactions that the participants perform. We find significant differences in the results between children and parents, indicating substantial differences in how their bodily actions are coupled with their visual attention between developing (child) and developed (adult) visual systems.

Visual Data Mining of Multimedia Multi-Streaming Data

With advances in computing techniques, a large amount of high-resolution high-quality multimedia data (video and audio, etc.) has been collected in research laboratories in various scientific disciplines, particularly in cognitive and behavioral studies. How to automatically and effectively discover new knowledge from rich multimedia data poses a compelling challenge since most state-of-the-art data mining techniques can only search and extract pre-defined patterns or knowledge from complex heterogeneous data. In light of this challenge, we propose a hybrid approach that allows scientists to use data mining as a first pass, and then forms a closed loop of visual analysis of current results followed by more data mining work inspired by visualization, the results of which can be in turn visualized and lead to the next round of visual exploration and analysis. In this way, new insights and hypotheses gleaned from the raw data and the current level of analysis can contribute to further analysis. We have developed various ways to visualize both temporal correlations and statistics of multiple derived variables and as well as conditional and high-order statistics. Our visualization tool allows users to explore, compare, and analyze multi-stream derived variables and simultaneously switch to access raw multimedia data.

Multimodal Human-Robot Interaction

Converging evidence from research in human-human communication and interaction suggests that the exact time-course of human behaviors is critical for establishing and sustaining interactions, from natural language interactions, to performing joint projects, to acquiring new knowledge through such social interactions. We developed a multimodal platform (instantiating the framework) for studying natural multi-modal human-robot interactions. The platform is composed of three critical components to support human-robot interaction studies: 1) The participant and the robot can freely interact with each other through multiple modalities, including speech, vision, gaze and body movements; 2) A real-time control system allows the robot to dynamically response to the participant's behaviors based on real-time observations of the participant's actions; 3) we collect mulitmodal data in such interaction from both participants and the robot which allows us to analyze dynamic and fine-grained interaction patterns between these two.

Embodied Language Acquisition

Language is about symbols and those symbols must be grounded in the physical environment during human development. Read More

Developmental and Statistical Models of early word learning

Previous work on early language acquisition has shown that word meanings can be acquired by an associative procedure that maps perceptual experience onto linguistic labels based on cross-situational observation. Read More

Action Recognition

Humans perceive an action stream as a sequence of clearly segmented ``action units''. This gives rise to the idea that action recognition is to interpret the continuous human behaviors as a sequence of action primitives such as `` picking up a coffee pot''.
Read More

Multimodal Perceptual Interface

The next generation of computers is expected to interact and communicate with users in a cooperative and natural manner while users engage in everyday activities. Read More


Last modified on Dec, 2014