Multimodal human–agent collaboration

Improving human-agent interactions through gaze input and other collaborative modalities.

Project overview

A human hand reaches toward a mirror image robot hand.

Artificial intelligence (AI) is now mainstream. Most of the world’s largest tech companies, including Microsoft, Google and Facebook, have core AI capabilities, and AI is increasingly being used in such broad contexts as assistive technologies, health, logistics, manufacturing, and games. AI techniques are typically based on models. However, like any model, AI models have limited accuracy and completeness. In particular, they often lack detailed information on what to 'do' in unusual and unpredicted situations, such as situations in which they have never seen data to be trained on or have not been modelled to handle explicitly by an expert.

On the other hand, the ability for humans to recognise unusual situations and adapt robustly is far superior to that of machines. Enabling people to quickly provide an intelligent agent with additional context and information that is not part of its model should increase the effectiveness and trustworthiness of the agent from the perspective of its human users and observers. The knowledge provided by the human user will also assist in creating common knowledge between the human and the agent, enabling the agent to provide assistance that is more likely to match the human users’ expectations and further improve the agent's effectiveness and trustworthiness.  Capacity for modelling such knowledge is already incorporated into the language of some automated planning agents/algorithms. However, currently, it is assumed that such knowledge is known at design time, and can, therefore, be encoded by an AI expert. Our vision is to enable non-AI experts to input new, contextual knowledge at runtime, to an intelligent agent who can then resolve the problem with the updated knowledge, with the aim of improving the solution.

Recent developments in natural user interface (NUI) technologies provide exciting opportunities for an artificial agent to capture and integrate new knowledge. Technologies that enable the capture of gaze, touch, voice, and gesture can provide a rich, multimodal interaction between people and artificial agents. Such technologies provide the potential to leverage the flexibility and adaptability of a human to enhance the powerful problem-solving capabilities of artificial agents. This project is an opportunity to collaboratively integrate research and practice from interaction design with computationally powerful artificial intelligence.

The objective of this project is to improve the interactions between an intelligent agent and a human user in the context of an intelligent assistant powered by AI planning. The project explores the potential to use human gaze and other modalities, such as touch, as a means to transfer knowledge from a human to an intelligent assistant. We will investigate ways to design human-agent interactions that allow the capture of this knowledge and how an artificial agent can then effectively use this knowledge to improve its performance.


This project is a collaboration between the University of Melbourne’s Microsoft Research Centre for Social Natural User Interfaces (SocialNUI) and researchers in the A.I. and Autonomy Lab in the School of Computing and Information Systems at the University of Melbourne.

Project team

  • Ronal Singh, Research Fellow, School of Computing and Information Systems, The University of Melbourne
  • Tim Miller, Professor, School of Computing and Information Systems, The University of Melbourne
  • Liz Sonenberg, Professor, School of Computing and Information Systems, The University of Melbourne
  • Frank Vetere
    Frank Vetere, Professor, The University of Melbourne
  • Eduardo Velloso, Lecturer, School of Computing and Information Systems, University of Melbourne
  • Joshua Newn, Research Fellow, School of Computing and Information Systems, The University of Melbourne

Contact details


Singh, R., Miller, T., Newn, J., Sonenberg,L., Velloso, E. & Vetere, F. (2018) Combining Planning with Gaze for Online Human Intention Recognition. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 488–496.