CoSy logo Cognitive Systems for Cognitive Assistants

The PlayMate: An Object Manipulation Scenario


Learning to categorise objects

A robot able to handle novel objects must be able not only to identify specific objects, for example my mobile phone, but instances of a category, any mobile phone. We have developed new methods that are able to recognise objects at the category level. In addition we want the robot to be able to learn to recognise everyday objects. We have devised new methods for solving this problem, and shown how we can tutor the robot via natural language. So to teach the robot we might show it a phone, and say "This is a phone." The robot learns a visual representation of the object that is suitable for category level recognition. However, we do not want to teach the robot about every member of a category of objects. Conveniently we can do without a tutor and learn to categorise objects by grouping the objects based on their inherent appearance. This unsupervised learning is quite good, but does make mistakes. To get around this problem we allow the human to tell the robot about a few examples of each object type, thus making the learning both fast, and precise. Our methods are appearance based, which means that to learn to recognise an object type from a viewpoint the robot has to be trained with a broadly similar viewpoints, but are robust enough to allow some variation in the orientation of the object. (Show details)

Appearance based learning of object parts for recognition

Much work on object recognition in the 1980s and early 1990s focussed on the idea of part based recognition. The core idea is that we can describe most objects from a small number of commonly used parts. If we can recognise these, and their configuration we can recognise a wide range of objects. At that time much work on object recognition attempted to reconstruct a 3D model of the object. This is a very difficult problem, and is still essentially unsolved. Because of the difficulty of 3D reconstruction many researchers turned to recognition methods that encode the appearance of an object from a particular viewpoint, rather than its 3D structure. In this project we have developed new methods for learning the parts of which an object is composed, but in an appearance based framework.

Situated dialogue and spatial reference

When humans make references to objects they often do so using spatial references that employ other objects as landmarks, for example I might say "Pass me the mug next to the phone". In addition humans are quite efficient communicators in that we will prefer references to an object that are easy for the listener to process visually. So I would refer to the "red mug" in preference to "the mug to the left of the phone". Typically we only use spatial referencing when other forms of reference are ambiguous, if for example there is more than one red mug in the scene. For a robot to understand references to objects in natural dialogue it must be able to process and make the right kinds of reference. In addition interpreting spatial references requires that the robot has a model of what it means for an object to be to the left of another object. We have built a system that is able to connect match references to objects in dialogue with the objects it can see in the scene. This means that the robot can have a relatively natural dialogue with a human.

Planning high and low level actions

Suppose you tell a robot to "Put the fork to the left of the plate, and the knife to the right". To plan this activity the robot has to reason both at the level of qualitative spatial reference "left of", and at the continuous level --- where precisely it should put the fork. In our approach we use a mapping between separate qualitative and continuous models of space. This allows the robot to look at scene, and extract both the precise spatial positions of the objects and the resulting qualitative spatial model --- the fork is behind the plate. When carrying out a task the robot plans at the high level first --- for example the robot plans to pick up the fork and put it down to the left of the plate --- and then it uses the mapping to the precise model of space to pick a precise location in which to place the fork. We then use a probabilistic road map planner to generate the precise trajectory for the robot arm avoiding obstacles on the way.

Architectures for robot cognition

How should we put together the pieces of an intelligent system? Our approach is to group processes that share representations into groups which communicate through a shared working memory. This group, together with its memory is called a sub-archictecture. The complete system is composed of a number of these. In AI terms this is very similar to what is known as a distributed blackboard architecture. The ability of each sub-architecture --- or even the processes within a subarchitecture --- to run concurrently, often on different computers is central to our approach. We have found that this parallelism or concurrency enables the robot to process information from utterances at the same time as looking at the scene, while reasoning about spatial relations. This model of cognition creates many challenges, not least of which is the challenge of engineering such large distributed real-time systems. Our demonstration systems currently contain about 35 basic components, distributed over seven sub-architectures. This concurrency also means that the architecture must have techniques for managing the way information and knowledge flow around the system. Our work has resulted not just in the conceptual architure, and demonstrable systems, but also in a software toolkit that enables the relatively rapid engineering of such cognitive systems. In the third year of the project we showed experimentally that employing a multiple workspace model where components are grouped according to their need for shared data results in advantages in processing speed and response to change.

Cross-modal learning of visual qualities

How can we teach a robot what the meaning of the word red is? We take a simple approach in which information from the camera about the most recently observed object is associated with the correct parts of the utterances describing the object. This results in the ability to teach the robot by showing it objects, and describing them. If I present the robot with an object and tell it that, "This is a small yellow thing." it will update the associations between simple visual properties of the object --- such as its bounding perimeter, hue, saturation and intensity --- and the qualities in the communiciation system of "yellow" and "small". Over a small number --- tens of objects --- the learning system can learn descriptions for objects that are quite reliable. In the third year of the project we have now shown how unlearning can be incorporated to deal with overgeneralisation in reference.

Learning and Recognition of Intentional Human Actions

In order to understand or imitate human actions a robot needs to build a model of them. We have developed a method for representing and recognising actions that enables the robot to learn by watching video clip examples of known actions, and then classifying new action sequences that it sees. This will be used in later work to allow the robot to watch a human performing some activity with objects, which it is then able to reproduce. This will require that robot is able to identify the intention of an action, so that achieves the purpose rather than just slavishly copying the human.

Planning of sensory processing

Since visual scenes are so complex, and therefore beyond complete visual analysis robots must tailor their visual processing to the task in hand. For a flexible robot this means that it must decide on the fly which visual processing to perform. One way to do this is to use planning. In this piece of work we used continual planning with assertions as a a way of generating plans for visual processing of a scene so that the robot can answer queries. The work so far is only a demonstration of the principle. Thus visual operators and plans are both quite simple. If the plans fail because a step of the visual processing does not return what is expected then replanning is triggered. In the next period we will look at a decision theoretic approach to planning which takes into account the degree of unreliability of visual processing.


Last modified: 8.1.2009 15:54:23