Evaluating the Evaluator

© Richard Bolstad

In There An Objective In The House?

For people who advocate the use of objectives, NLP Trainers have shown a remarkable lack of interest in whether their NLP Trainings actually reach or even have any measurable objectives. The requirements for NLP certification in all the major international certifying bodies (IANLP, INLPTA, SNLP etc) tend to simply list content areas and specify times to be spent studying them. In redressing that, our aims here are rather modest. We will equip you with the core definitions used in standard assessment and evaluation theory, and make sense of them with examples from NLP training.

Standards, Performance Criteria and Range

Our organisation is in the process of having Practitioner and Master Practitioner level trainings registered as Certificates with the New Zealand Qualifications Authority. To do so we first needed to translate our content areas into what are called “Unit Standards” and “Performance Criteria” (Walklin, 1991, p 1-7). These are the educational equivalent of well formed outcomes. The idea of standards based training, which we generally agree with, is to focus not on what happens in training itself (how many hours the learner sat in a classroom, or how many pages of notes they were given etc) but on what they can do after they have completed the training. “What they can do” includes:

The desired knowledge. This is the “what” in the 4MAT model (Bolstad, 1997). (eg being able to explain what “Rapport” is)
The desired skills. This is the “How” and the “What if?” in the 4MAT. (eg being able to establish rapport with a person)
The desired attitudes. This is the 4MAT’s “Why?”. (eg being willing to use rapport skills, and considering them useful).

Luckily, these three areas are linked together. Using rapport skills effectively requires knowing what rapport is. In general, Social Psychology research shows that the more someone uses a certain skill, the more they come to value it (Myers, 1983, p 44-69).

The notions of standards and performance criteria are part of an approach to assessment that is called criterion referenced assessment (Stock et alia, 1987, p 23-25). In criterion referenced assessment, you are checking that trainees meet your criteria (eg that they’re “good enough” to be called an NLP Practitioner). The other, more traditional approach to assessment was norm referenced assessment. In norm referenced assessment, the teacher’s aim was to check who was “best”. If you are having brain surgery, you don’t want a norm referenced surgeon; trust me! You don’t want to know that your surgeon was the best in their class (after all, it could have been a bad year or a hopeless medical school). You want to know that they can do the stuff they’re planning to; that they met the criteria. As far as I know, all NLP certification assessment used so far is criterion referenced.

The idea of setting standards is that if someone really learns something, then they will be able to show you that they can use that learning consistently and reliably in future situations. What counts is not how much written or spoken information they were given or how much time they spent practising some skill. What counts is if they can demonstrate the desired skills, knowledge and attitudes reliably afterwards when asked to. Their behaviour will meet a “standard”. For example, one standard in NLP Practitioner training might be:

Trainees are able to establish rapport.

Now, you’ll probably ask “Well, how do we know whether they’ve met the standard?” We need to be a little more sensory specific. Performance Criteria are sensory specific ways to measure a standard. For example, to build rapport, we might say that the trainee needs to calibrate the behaviours of the other person (such as their breathing rate), pace those behaviours, lead those behaviours, and re-calibrate to check if leading is occurring. If we were watching the trainee after they learned rapport, we could ask ourselves:

-Was the person’s behaviour calibrated?
-Was the calibrated behaviour paced?
-Was the trainee’s behaviour then adjusted (leading)?
-Was the person’s behaviour re-calibrated to check that leading is occurring?

These are performance criteria. To get even more precise, we might define the words we use by referring to some standard definition, eg

For definitions of rapport, pacing, leading and calibration, see Bolstad/Hamblett, 1998

Also we could define the “range” of the word “behaviour” (which behaviours we want them to be able to calibrate, pace and lead, at a minimum) eg:

Range of “behaviour” is: physical posture, gestures, breathing rate, voice tone, speed of speech, key words, sensory predicates.

You cannot effectively assess people’s behaviour unless you know what your objective was first! It’s fairly obvious, in NLP terms, that having such objectives for our training will deliver better results than simply having a list of what we cover that says “5. Rapport skills”. It’s also clear in NLP terms that our trainees will benefit by knowing the standards they are being assessed with. After all, one of the things we teach is that it helps to know what your goals are. Amazingly, very few NLP trainees get this information.

Validity and Reliability

Validity and Reliability are two ways of checking on your assessments (Walklin, 1991, p 10). When we go to check that trainees have met these performance criteria, we want to make sure that the method we use to check with (the “assessment”) actually checks for the thing we meant it to. This is called “Validity”. For example, if we decided to assess the ability to establish rapport by asking if someone could write a definition of rapport, or by noticing if the person they sit down with smiles at them, these would probably not be valid assessments. NLP assessments that we have reviewed are frequently low on validity. What is called in some courses a “written integration” may in fact not test for or encourage integration at all, but merely test the ability to copy isolated answers from a manual. If we really wanted Practitioners to integrate their learning, we could set performance criteria for this, and design a valid method of checking that integration is occurring.

We also want to make sure that the method we use will work reliably, meaning that the assessment gets the same result each time it is used. For example, let’s say we ask our Training Assistants to sit in on pairs doing an exercise at the training and check that people are able to establish rapport. We may find that one Assistant sits closer in than the others, frowns a lot, and creates a climate where the trainees get more anxious, and stop using their rapport skills. Another Assistant may sit further away, smile, and demonstrate the use of rapport skills herself as she checks. The two Assistants may thus get quite different results when they assess the same trainee, meaning that the assessment is not “reliable”. Much of the assessment on our trainings is done by Assistants in this format, so we pay attention to ensuring that Assistants use a reliable format, and that trainees are assessed by more than one Assistant..

Assessment and Evaluation

In traditional educational theory (Stock et alia, 1987, p 17, and p 31-32), “Assessment” refers to checking each performance criterion in the way we’ve been discussing it. “Evaluation” refers to the judgements we make based on that assessment. Checking that someone can breathe in time with someone is an assessment. Deciding whether someone qualifies as an NLP Practitioner is an Evaluation.

Assessment itself can be either formative (happening as the training is going on) or summative (happening at the end of a training). The idea of summative assessment is to be able to check that the trainee can do “everything” you wanted them to, at the same time if need be. It’s an impressive aim to check this, but in practice it’s pretty much an impossible dream (despite the fact that schools keep trying to do it!). Summative assessment leaves you with some difficult choices in evaluation. Can you really test the whole Practitioner course, and if not, how do you choose which samples to include in the test? Do you “fail” someone if they miss pieces, or do you tell them to re-view set pieces? Formative assessments enable you to make changes to the training as you go along, in order to re-teach sections that the assessment shows were not learned.

Assessment Tools

One of the most common ways of categorising tests is to call them “Objective” or “Subjective” (Stock et alia, 1987, p 85-100). Objective tests are those where the correct response is precisely determined. Obviously, if the answer is either correct or not, the test is more likely to be reliable (meaning it will work the same way with every trainee; see above). However, the test may not be valid! (It may not be testing what you intended) There are a lot of very sophisticated ways to find out if a test, or an item in a test, is valid (Stock et alia, 1987, p 91-120). Often these involve analysing the results of using the tests. One important way of checking the validity of an objective assessment is to use a “taxonomy” (classification) of educational tests, which will be described below.

Objective Tests

For now, note that examples of objective tests include:

Multiple choice or multiple response items. e.g.

The establishment of rapport is a result of:

calibration
matching behaviour
leading
both 1 and 2
all of 1, 2 and 3

Newcomers to assessment sometimes generate “objective items” which are not so valid. One example with multiple choice is to make the correct answer obvious by its precision, as in the following case: A strategy, in NLP terminology, is:

a sensory system
a positive anchor
the order and sequence of internal representations leading to an outcome
a submodality change
a change process

Sequencing or matching questions. e.g. Link the following predicate words to their sensory system using numbers. (Visual and perspective are already linked as an example for you)

          Visual (1)                     consider
            Auditory                      perspective (1)
           Kinesthetic                  stink
            Olfactory                     resonate
            Gustatory                    bitter
           Auditory digital           grasp

True/False e.g. The presence of leading demonstrates the establishment of rapport.

Recall/ Completion Exercises. e.g. On the following diagram show the position of Visual Recall in a normally organised right handed person.

Subjective Tests

Subjective tests require the assessor to make more judgements about the quality of the trainee’s response, and are thus more subject to biases such as the “halo” effect (the sense that if most of it is correct, probably it all is). Following are some Subjective testing methods.

Essays and Essay questions. e.g. 1) A client tells you she suffers from panic attacks before school tests.
A) Describe the process of collapsing anchors.
B) Discuss the pros and cons of using collapsing anchors with test anxiety.

Such questions can be made “more objective” by the establishment of clear criteria by the trainer. For example, in A) above, the criteria might be:
– refers to the creation of two anchors (present state and desired state, but this terminology is not required
– describes the simultaneous firing of the two anchors.

Diaries and logs of work completed. For example: Keep a record (100 words per entry) of the results of work with your next ten clients.

Case studies. For example: Complete a one hour NLP change session with a client and write a 1000 word report on it using the RESOLVE model for NLP therapy.

Interviews. Asking questions such as: How would you help someone with a phobia of movie theatres?

Appraisal forms. A checklist or rating scale such as the one an assessor uses when assessing the competence of an applicant for a driver’s licence. For example:

Checklists are often applied when observing or in a simulation (see below). The more sensory specific they are, the more reliable they become.

Simulations. The trainer sets up conditions as close to reality as possible, but able to be assessed. For example, Practitioner trainees could be asked to set a wellformed outcome with an actor trained to violate conditions of wellformedness in a reliable way (saying what they don’t want instead of stating the outcome positively and so on). The trainer would observe each trainee’s response.

Observation of a skill. A task is set and either the results are assessed (eg, in the case of observing someone establishing rapport, one might check if leading occurred) or the process is assessed (eg in the case of establishing rapport, one might check if the person paced behaviour and tested for leading). Most NLP trainers would agree that they want the process to be correct, rather than the result, provided that the trainee has a “game plan” for what to do when the results are not as expected. Observation is the most popular form of assessment on NLP trainings. The most important point to re-emphasise about its use is that for observation to be reliable and valid, it needs to be done in a set format and with an awareness of the performance criteria being observed for.

Taxonomies of Learning

Finally in this overview of assessment, we want to describe some important ways to classify assessment processes. In 1956, Benjamin Bloom (De Landsheere, 1990, p 179-181) developed a taxonomy (classification) of Cognitive Educational Objectives. Bloom pointed out that when we say we want trainees to learn something, we may mean one of six things:

Sometimes we just want to check that a trainee has remembered a fact or theoretical statement (Bloom calls this Knowledge). Eg “List seven types of behaviour which can be matched to achieve rapport?”
Sometimes we want to check that a trainee can see the relationship between facts or theories, describe a fact or theory in a new way, or realise what is implied by a fact or theory (Bloom calls this Comprehension). Eg “When working with a client, what implications does the concept of rapport have for our choice of sensory representational systems?”
Sometimes we want to check that a trainee can apply a fact or theory in a particular situation (Bloom calls this Application). Eg “You notice that your new client disagrees with each of the first five sentences you say to him. He seems to enjoy disagreeing. How would you build rapport with him?”
Sometimes we want to check that a trainee can chunk down on a fact or theory (Bloom calls this Analysis). Eg “What differences and what similarities are there in the use of direct and cross-over matching to achieve rapport?”
Sometimes we want to check that a trainee can chunk up on a fact or theory (Bloom calls this Synthesis). Eg “How does rapport help to build human relationships?”
Sometimes we want to check that a trainee can evaluate a fact or theory based either on its own internal consistency or based on external standards (Bloom calls this Evaluation). Eg “When is rapport desirable and when is it not? Explain in each case why you reach this conclusion.”

Blooms model is only a model. However it does reveal certain interesting anomalies in our teaching of NLP. Often NLP written assessments are Knowledge heavy. Certain subjects are often assessed only with one classification of question (eg the Metamodel is often assessed entirely by application).

There are many other taxonomies of learning. One covered by us elsewhere is the 4MAT (Bolstad, 1997) in which learning is thought of in terms of answering the four questions Why?, What?, How?, and What if? V Gerlach and A. Sullivan developed a taxonomy for use with the learning of active skills, discriminating between being able to “Construct” (producing a product meeting specifications given to them), and Demonstrate (performing a set of behaviours without the specifications, and meeting preset standards). After all, knowing that your trainees can run the NLP phobia process by following their notes is different to them being able to do it without notes, and different again from being able to modify the process in midstream to cope with unexpected or additional factors. D. R. Krathwohl (De Landsheere, 1990, p 185) developed a taxonomy for assessing attitudes. At the lower level it lists willingness to receive an experience (eg being willing to experience rapport), followed by being willing to respond (eg being willing to create rapport), then valuing (eg considering rapport as useful), organisation of the new value into one’s value system (installing the value of rapport in one’s hierarchy of values for relationships), and characteristization, where the value is integrated into one’s sense of identity (eg experiencing oneself as a person who is good at rapport).

Levels of Learning and Assessment

Another taxonomy is Gregory Bateson’s levels of learning (Dilts, 1996, p 226-241). Bateson described four levels of learning. Blooms entire taxonomy fits into his first two levels:

Learning I (Stimulus-Response style learning, such as a rat learning by reinforcement to approach a red object and avoid a blue one).
Learning II (Learning about the stimulus-response process. Learning to recognise the wider context in which the stimulus-response learning has meaning, such as a rat learning that it’s valuable to check out strange objects because they may be positively reinforcing)
Learning III (Learning how to learn. This is something that a rat doesn’t do. But you can, and that’s what this article is about).
Learning IV (Learning about who it is that is learning. Spiritual learning).

Bateson describes the difference between Learning I and Learning II thus: “…you can reinforce a rat (positively or negatively) when he investigates a particular strange object, and he will appropriately learn to approach it or avoid it. But the very purpose of exploration is to get information about which objects should be approached or avoided. The discovery that a given object is dangerous is therefore a success in the business of getting information. The success will not discourage the rat from future exploration of other strange objects.”

Discussing Bateson’s work, Robert Dilts emphasises the importance of considering the meta-messages our teaching gives students. If we only assess certain levels of learning, our students will meta-learn that they only need to rote-learn. There are other meta-messages communicated by assessment. In their book Dynamic Learning, Dilts and Epstein discuss the notion of dynamic assessment (1995, p 325-344). They say that if we find that a trainee cannot remember the name of the metamodel patterns, we have only assessed the specific content of her memory. If we at the same time investigate how she does not remember the metamodel patterns, we will enable her to learn how to learn them. Dilts and Epstein (1995, p 332) say “With Dynamic Assessment the way we assess whether somebody is having trouble spelling automatically tells us what to do to improve the spelling…. The whole point is, the way that I tell that the child has a problem and the way that I treat the problem need to be related. The assessment and the intervention have to go together. So that if I assess that the child is having a problem spelling because his or her eyes are going down and to her left, I can tell the child, “Put your eyes up and to the left and you will spell better” –and she will.”

Taking Charge Of Learning

John Stock and colleagues (Stock et alia, 1987, p 201-205) refer to the same dilemma which is mentioned by Dilts and Epstein. They say that the combination of assessment and adjustment of learning is evaluation (which we defined earlier as the judgements we make as a result of assessment). They say “Evaluation is the process of gathering information about what is presently happening and comparing the results of that information-gathering with a view to what should be happening.” They emphasise that another important metamessage about assessment is given by our decision about who is doing the assessing. They suggest that if the trainer takes charge of all evaluation, they are limiting the logical level of learning, by preventing the trainees from learning how to learn! Trainees who are assessed only from outside become increasingly passive and accepting of outside evaluation. They are like rats being taught a metamessage that it is not useful to explore new objects, but only to use the ones the trainers provide. In NLP terms, trainer controlled assessment inhibits the trainee from “being at cause” in their learning.

There are many ways in which assessment can be delivered into the hands of learners. Learners can be given the specific performance criteria for a task before commencing it, and asked to assess how well they met those criteria and what they need to do to adjust afterwards. Learners can be invited to mark each others written tests based on objective answers or on performance criteria. Trainees can be placed in groups where each person takes turns being an observer and assessor of the skills taught, and then delivers assessment to another person (thus “going meta” to the level of learning the skill itself). These methods also lead to an integration of the evaluation of learning and the evaluation of the training. John Stock and his colleagues (1987, p 204) say “Self evaluation means that information will be gathered about: The achievement of the trainees; not just knowledge and skills but also attitudes or qualities. The process of the training; the methods, resources and style of the trainer as well as the context of the training namely the training organisation –and that the information will be gathered by the individuals to whom the information refers.

Finally, it is important to notice that learning to use rapport skills in a training is not the same as learning to use rapport skills in ones relationship with clients. Joseph O’Connor and John Seymour (1994, p 202-203) describe four stages of evaluation of organisational training:

Live evaluation of the learning as it is happening, in order to adjust course. This is what we previously termed formative assessment.
End of training evaluation of the learning and the training itself. This is summative assessment.
Transfer evaluation of how well the trainees have taken the skills back to their workplace.
Organisational evaluation of whether the training has actually helped the organisational goals it was set up to meet.

These last two types of evaluation broaden the notion of training again, suggesting that training is only a means to an end. The real question is whether that end has been achieved, and how to achieve it.

The Risk Of Escalating Standards

As NLP Trainers learn about and apply these distinctions in evaluation, the standard of evaluation, we believe, will improve. This is different to escalating the requirements for certification.

When trainers first begin actually evaluating their NLP Practitioners carefully, they frequently discover, with a shock, how little of what they assumed was taught is actually being learned! They then often start thinking about how they could “make sure” that, before these people are released, they learn more. This is a well recognised phenomenon in professional training. In training both teachers and nurses, for example, we have explored with new trainers the shock they face when they realise just how little new graduates in the field actually know! Then, there is the shock they experience when they realise, on thinking back, just how little they themselves actually knew when they first graduated from their own basic course.

In fact, most of what we know about NLP, we learned after our original Practitioner training. Practitioners do not need to have our level of knowledge, based as it is on the deepening of understanding we got from the Master Practitioner course, from ongoing use of the techniques, from ongoing reading, and from actually working out how to teach it to someone else. Yes; new Practitioners will be “raw”; they will have major misunderstandings in some areas; they will have oversimplified views of what NLP is. This is normal. But the effects of NLP Practitioner training do not end at the moment your graduates leave the training room on the last day. They continue as Practitioners review in their minds and make sense of what has happened. For your course to succeed, your graduates only need to be Practitioners, not Master Practitioners, and not Trainers.

The attempt by new trainers (in any professional field) to escalate standards results in the gradual lengthening of trainings, or in their becoming overburdened by theory and requirements. From 1955 to 1961 a group of highly respected authorities including Leslie Le Cron, William Kroger MD and Milton Erickson MD organised a travelling teaching group called “Hypnosis Seminars”, teaching a three day course in hypnosis for dentists, doctors and psychologists. Their opinion was that basic hypnosis is easy to learn, though it may take years of practise to become expert. They “did not believe it essential that its students have intensive background study in basic psychology, psychopathology, or psychodynamics. They held that the average physician, psychologist or dentist could, with this minimal three day training, be trusted to use hypnotic techniques in his practice or research in a primarily beneficial way.” (John Watkins, 1965, p 56). These experts understood that basic trainings are basic trainings. Escalating the level of theory taught in a training, or even the amount of practice, does not necessarily improve results. Consider the field of counselling and psychotherapy. Katherine Mair (1992, p 150) reports that “Hattie et al (1984) reviewed forty-three studies in which “professionals” defined as those who had undergone a formal clinical training in psychology, psychiatry, social work or nursing, were compared with “paraprofessionals”, educated people with no clinical training, for effectiveness in carrying out a variety of psychotherapeutic treatments. They came to the unpalatable conclusion that the paraprofessionals were, on average, rather more effective.”

Summary

We have covered a large selection of distinctions about the process of assessment and evaluation in NLP training. We discussed the usefulness of setting standards and sensory specific performance criteria to check them by. We pointed out that assessment needs to be valid (related to the criteria being assessed) and reliable (repeatable). Assessment, combined with judgements about how to change course in order to meet the desired criteria, is called evaluation. There are a vast number of ways to assess. Those that have one specific correct response (such as multiple choice, true false, matching, and diagram completion items) are called objective. Those that are open to interpretation (like essay questions, diaries, case studies, simulations and observed exercises) are called subjective.

Whichever type of assessment is being used, it is important to clarify the logical level of learning which is desired, because what is assessed tends to influence what you get. In terms of cognitive learning, for example, learning can be considered at the level of knowledge, comprehension, application, analysis and synthesis, or Evaluation. Learning can be thought of as rote learning (Learning I), meta-learning (Learning II), learning how to learn (Learning III) and spiritual learning (Learning IV) using Bateson’s logical levels. Assessment can also be designed so that learners themselves participate in it, and so that the very process of assessment gives the directions for increasing success. Finally, learning can be assessed as you go (formative assessment), at the end of training (summative assessment), in transfer to settings outside the training, or in terms of the wider goals that the training was designed to reach. The aim of learning about assessment is not to escalate the standards of training; it is to ensure those standards are being met.

The aim of this essay is not to set down rules about how you should assess. It is to allow you to step a logical level above your assessment and evaluation: to evaluate the evaluator, so that both your teaching and your trainees’ learning is enhanced.

Bibliography:

Bolstad, R. “Teaching NLP: How To Be Consciously Unconsciously Skilled” in Anchor Point, Vol 11, No. 7, p 3-9 and Vol 11, No 8, p 3-12, July 1997 and August 1997
De Landsheere, V. “Taxonomies of Educational Objectives” in Walberg, H.J. and Haertel, G.D. ed The International Encyclopedia of Educational Evaluation, Pergamon, Oxford, 1990
Dilts, R. B. and Epstein, T.A. Dynamic Learning, Meta Publications, Capitola, California, 1995
Dilts, R. B. Visionary Leadership Skills, Meta Publications, Capitola, California, 1996
Mair, K. “The Myth of Therapist Expertise” in Dryden, W. and Feltham, C. ed Psychotherapy and its Discontents, Open University Press, Buckingham, 1992
Myers, D. G. Social Psychology, McGraw-Hill, Auckland,1983
O’Connor, J. and Seymour, J. Training with NLP, Thorsons, London, 1994
Stock, J., Macleod, J., Holland, J., Davies, A., Jennings, P., Cross, M., Richards, J., and Garbett, J. Training Technology Programme: Volume 6: Assessment and Evaluation in Training, Parthenon, Carnforth, England, 1987
Walklin, L. The Assessment of Performance and Competence, Stanley Thornes Ltd, London, 1991
Watkins, J. “Hypnosis in the United States” in Marcuse, F.L. ed Hypnosis Throughout The World, Thomas, Springfield, 1965

Richard Bolstad is an NLP Trainer and can be reached at learn@transformations.org.nz