We Have New Teacher Evaluations. Are We Rating Teachers Any Differently?

The short answer is not really.  In the past five years states have invested time and money in improving the quality of their teacher evaluation systems.  Many states were moved to do this work by Race To The Top or to qualify for federal waivers from No Child Left Behind.  The goal was to design and implement evaluations that were more effective at identifying low performing teachers (given the conclusion that high quality teachers made a difference).  Identifying low performing teachers was essential to either, provide them additional support so they could improve or to replace them.  However, Kraft and Gilmour found that our new evaluation systems are much more discriminating between "great" and "good" teachers (the top end of the evaluation scale), but do not identify more low performing teachers. 

Kraft and Gilmour interview principals in one large district in the Northeast to find out why principals tend not to rate teacher low (even when they report that they have many low performers in their building).  The reasons include:

  • Fear that the replacement might be worse.
  • Belief that a low rating might demoralize a teacher and they may no longer want to improve.
  • Amount of work (particularly time constraints) it takes to work with a teacher "In Need of Improvement".
  • Personal discomfort giving a low rating.

As the authors note, failure to give low ratings is a product of "conscious choices by evaluators as they navigate implementation challenges, competing interests, unintended consequences, and perverse incentives."  In other words, the evaluation (e.g. rubric, rating system) might be effective at identifying low performing teachers if applied dispassionately.  However, evaluations are not applied dispassionately, they are applied by leaders that are working really hard to ensure that schools meet student needs.  We might be careful not to depend to heavily on teacher evaluations as a tool to improve teaching and learning.  

Dashboard vs. Accountability System (What's the Difference)

I noticed a couple of months ago on Twitter a few retweets about an article The Standard, which is the journal of the National Association of State Boards of Education (NASBE), discussing dashboards.  Robert Rothman tweeted a link to the an article titled "Accountability for What Matters".  I was definitely intrigued, so I read the article.  I sent a quick tweet back at Mr. Rothman saying I thought the article was interesting, but seemed to conflate the meaning of accountability systems, performance management, and dashboards.  He didn't respond, so I thought I thought I would clarify what I mean in a blog entry. 

First, let me unpack what Rothman means by "dashboard."  Rothman writes:

With their newfound authority under the Every Student Succeeds Act (ESSA), states will be using a broader range of indicators of school performance and displaying them in ways that give school communities, parents, and district and state officials a clearer picture of how a school actually is doing.

These new systems, often called “data dashboards,” function the way a car’s dashboard does—by displaying multiple measures that affect how a school is performing.

Rothamn uses the term "dashboard" to refer to both a broader set of indicators beyond student assessment results (e.g. climate information) and a visual display of those data.  Rothman does not distinguish between new accountability systems and display (aka dashboards).  In fact, under the section Why Dashboards? Rothman says that "current accountability systems" have problems, "dashboards can alleviate some of the problems."  By contrasting "current accountability systems" with "dashboards" Rothman has put two separate concepts into one idea.  This causes confusion.

So, what do these terms mean.  First, as Morgan Polikoff clearly lays out (Ed Researcher 2013), there are two primary streams of thinking that support accountability in education.  The first stream of thinking is that the accountability system creates incentives that focus attention on behaviors that matter and result in student outcomes.  The second stream of thinking is the educational consumer approach which broadly states that better information about school performance means better choices by consumers (parents and children).  Thus, an accountability system is designed to facilitate accountability to someone, an organization, or a group.  It includes measures that allow decision making (e.g. priority school, school in need of improvement).  During the NCLB period the primary accountability system deployed by state agencies was AYP, which mostly depended on performance on state tests (in some cases growth was included).  Some states had additional accountability systems that included measures beyond test scores.  For example, Colorado included graduation and dropout rates in their School Performance Framework (SPF). 

Second, according to Stephen Few, dashboards are a "visual display of the most important information needed to achieve one or more objectives which fits entirely on a single computer screen so it can be monitored at a glance."  Few's definition includes four features of dashboards: (1) visual displays, (2) most important information needed to achieve an objective, (3) fit on a single screen, and (4) monitored at a glance. 

In short, "dashboards" (or "data dashboards") will not solve our accountability problem.  Dashboards are only a visual display.  We must first design new accountability systems that meet the criteria set out by Morgan Polikoff.  We might design a dashboard to display accountability information, but this is a separate effort after the new accountability system has been designed. 

What Rothman may be trying to get at is that we need to improve our performance management focus.  I described in more detail performance management in a previous blog post.  In short, what Rothman might be arguing for is identifying leading and lagging indicators and developing technology that allows for ongoing monitoring of these indicators.   Unfortunately, the mixing of terms that have specific meaning only confuses the reader.  

Hard Work, Not Hype

One of the few blogs that I follow on a regular basis is Stephen Few's Perceptual Edge.  Stephen Few is the author of several books on designing effective dashboards and is widely regarded as one of the most thoughtful and thought-provoking authors in the data visualization sector.  If you are remotely interested in data visualization you already know of Few and probably follow his work. 

However, what I want to reference today not Few's work on visualization, but a poignant post   that Few published in early January this year.  In this post, "There Are No Shortcuts to a Bright Future", laments that we are too enamored by magical solutions and we must "get down to the hard work of real problem solving."  Few is disappointed that hype has replaced hard work and drive for recognition has overcome satisfaction (of a job well-done). 

For the last few years we have seen this same ethos emerge as dominant in education.  Schools, policy-makers, and entrepreneurs are increasingly selling new and "innovative" technologies or teaching environments.  They are seeking and winning awards for "innovation".  However, as Saul Kaplan, author of Business Model Innovation Factory, wrote "it is not innovation until value is delivered."  It feels like "value" in education is increasingly about buzzwords and hype, not about whether kids leave better prepared for college or whether students fewer students have dropped out.  

Few's post reminded me that there are no shortcuts to a better future.  Whether we are doing are targeting Marginal Gains or Transformation we must invest time and energy.  Our success should be judged by results, not hype.  

CCSS Assessments Are Good

It is probably still an open question, but on February 11, 2016 the Fordham Institute (an education policy think tank) released a report on the next generation assessments (i.e. ACT Aspire, MCAS, SBAC, and PARCC) arguing that they meet the criteria for quality assessments.  Fordham Institute contracted with two principal investigators to answer the following questions:

  • Do these tests reflect strong content?
  • Are they rigorous?
  • What are their strengths and areas for improvement?

Using the CCSSO criteria for Procuring and Evaluating High Quality Assessments the investigators evaluated summative assessments and concluded:

  • Overall, PARCC and Smarter Balanced assessments had the strongest matches to the CCSSO Criteria.
  • ACT Aspire and MCAS both did well regarding the quality of their items and the depth of knowledge they assessed.
  • Still, panelists found that ACT Aspire and MCAS did not adequately assess—or may not assess at all—some of the priority content reflected in the Common Core standards in both ELA/Literacy and mathematics.

The investigators argue that these next generation assessments are a significant improvement over previous assessments:

For too many years, state assessments have generally focused on low-level skills and have given parents and the public false signals about students’ readiness for postsecondary education and the workforce. They often weren’t very helpful to educators or policymakers either. States’ adoption of college and career readiness standards has been a bold step in the right direction. Using high-quality assessments of these standards will require courage: these tests are tougher, sometimes cost more, and require more testing time than the previous generation of state tests. Will states be willing to make the tradeoffs? (Executive Summary p. 24).

These assessments meet the criteria set forth by CCSSO.  According to the report there are improvements that could be made, but, as Fordham Institute press release notes, these are " kind of tests that many teachers have asked state officials to build for years." 

I think this is a good first step to ensure that the conversation about quality assessments is ongoing. 

Over at Nonpartisan Education Blog there is an article by Richard Phelps sharply criticizing the Fordham report as lacking research rigor and being biased because Fordham has advocated for the Common Core.  This article is worth reading, but keep in mind that Phelps has partnered with opponents of the Common Core on several articles, including some of the most vocal critics of the standards Sandra Stotsky and James Milgram.  In addition, Phelps has co-authored a paper arguing that PARCC actually stunts student growth.  So, Phelps is not without his own bias.  

What do teachers think of your evaluation system?

Across the country school districts are implementing new teacher evaluation systems that are intended to increase the frequency and quality of feedback that teachers receive.  The belief is that with more observations and better quality feedback teachers will improve.  In fact, the amount of effort schools invest in developing and implementing new teacher evaluations has been immense.  One study found that principals spend over 5% of their time on classroom observations and much additional time writing evaluations and meeting with teachers.  But how do we know it teachers value our feedback or where we might improve? 

Despite the heavy investment in new evaluations, most schools are not evaluating whether teachers value the feedback they receive or what makes them more likely to value certain feedback.  In November of 2015 the Regional Education Lab at Marzano Research released the Examining Evaluator Feedback Survey.  This survey is designed to collect teachers' perceptions of the feedback they receive from their primary evaluator.  The survey was developed using an iterative process that included researchers and practitioners involved in the Educator Effectiveness Research Alliance.  The survey and supporting documentation is available here.

I highly recommend that schools and district use this survey as a part of their plan to evaluate the quality of their teacher evaluation system.  To support system that use this survey I have developed some dashboards that make it much easier to organize and analyze the data from this survey.  Below is a screenshot.  Checkout all the dashboards here.

I'd love your thoughts on how this could be improved.  

Contact me if you are interested in using this survey and these dashboards to collect feedback from your teachers.  

What Makes a Good Accountability System?

As we enter the new era of Every Student Succeeds Act (ESSA) states have greater flexibility in designing new accountability systems.  This blog entry considers the characteristics of a quality accountability system.  We should keep these characteristics in mind as we develop new systems and ensure that our state education departments are disciplined in their development of new, more useful accountability systems. 

In an article in Educational Researcher (2013) Morgan Polikoff (and others) stated there are two broad theories that support accountability in education: (1) that incentives in the accountability system will direct schools towards behaviors that will improve student outcomes (principal agent theory) and (2) accountability information helps consumers (parents and children) make better choices (experiential goods).  Polikoff and colleagues argued that any accountability system that is going to achieve these goals needs to meet four criteria: (1) construct validity, (2) reliability, (3) fairness, and (4) transparency. 

Construct validity means that performance measures used (mostly test scores in the NCLB age) adequately cover the desired student outcomes and the inferences made on the basis of those measures are appropriate.  Of course, under NCLB there were numerous problems with the construct validity because proficiency levels (the primary measure used) did not estimate the contribution of the school and teacher and growth was difficult to measure (even though considered important).  In our new accountability systems we need to stay disciplined in our selection of measures to ensure that they align with the our desire student outcomes and we can make appropriate inferences (e.g. the school contributed to the outcome).  Our desired outcomes may be beyond reading and math (as it was under NCLB) and include other skills or competencies.  The accountability system should be designed to have construct validity relative to those desire outcomes. 

Reliability refers to the consistency of classification.  Accountability systems should be reasonably reliable in how they classify schools.

Fairness refers to whether the classifications are primarily due to factors beyond the control of the school.  In short, this means that system should not unfairly identify schools based on demographics.  The accountability system for New Hampshire's waiver unfairly penalizes schools and districts that are large or have high proportions of special education students.  What ends up happening is that some schools have special education counted multiple times.  In fact, in the calculation of Focus Schools (those with the largest gaps) most had special education counted twice.  On the other hand, most of the schools with the lowest gaps (deemed high performing) did not have special education counted at all.   

Lastly, transparency is whether the goal setting is clear and the performance measures are easily understood.  States must make clear how performance measures are being used to classify schools and make the data available to the public for inspection. 

As states begin developing new accountability systems, we need to make sure these systems meet these minimum characteristics.  Stakeholders, including parents, educators, and the public, need to hold state departments of education accountable for developing new systems that meet these characteristics and result in improved student outcomes and better information for consumers.

This Program is Evidence-Based. Then Why Doesn't It Work?

This program works.  I guarantee it.  Maybe.  It depends, of course.

A recent post from Lisbeth Schorr urges a change to our understanding of what counts as “evidence”.  Schorr writes that for too long that what has counted as “evidence-based” are programs that have been tested in a randomized control trials and shown to have a positive impact.  Schorr argues that, while it this type of research of research is valuable, it also leads to a sort of "silver bullet syndrome" where we spend our time trying to find the perfect program.  Interventions are likely to get differing results under differing conditions and when we commit too much attention to whether a program is "proven" in another context we lose sight of the need to monitor whether it is working in our local context.  Schorr concludes that a focus on programs that “actually work” (or are "evidence-based") is keeping us from getting better results. 

Instead, of focusing in on finding the "silver bullet" solution more energy should be devoted to using approaches that can help us expand our understanding of how interventions behave in the local context and how we can improve results.  A change to our concept of “evidence” is not a matter of lowering expectations for “proof”, but rather recognizing that our job is to understand local complexities and achieve positive results for students.  A process that develops improved “practice-based evidence” is what we need in education. 

Schorr argues that a change that focuses on the behavior of interventions in the local context encourages greater innovation by acknowledging that local conditions impact the results we achieve.  Instead of accepting that interventions that worked elsewhere will work locally, we should be doing rapid tests of innovations to determine what variables impact success.  One way we can do this is by using improvement science and implementing networked improvement communities (NICs).  NICs, which were initially conceived in the 1960s, have begun to spring up in education in recent years.  NICs are mostly connected with the work of the Carnegie Foundation for the Advancement of Teaching in Stanford, CA.  The six core principles of Carnegie guide the work of NICs.  

  1. Make the work problem-specific and user-centered.
  2. Variation in performance is the core problem to address.
  3. See the system that produces the current outcomes. 
  4. We cannot improve at scale what we cannot measure. 
  5. Anchor practice improvement in disciplined inquiry.
  6. Accelerate improvements through NICs.

As we change our concept of "evidence" and acknowledge that local conditions matter, we must also adopt a disciplined approach to being using "evidence".  Implementing the core principles of Carnegie (or similar approaches from analogous organizations like the Institute for Healthcare Improvement), are one path to improving the performance of our systems and improving outcomes for kids.  

The One Percent Solution

"A culture of discipline is not a principle of business; it is a principle of greatness."
Jim Collins in Good to Great and the Social Sectors

Dave Brailsford is an enormously successful professional cycling and UK Olympic Cycling coach (I owe James Clear for exposing me to this story - http://jamesclear.com/marginal-gains).  In 2010 Brailsford was selected to be the general manager of the new British-based professional cycling team Sky.  Brailsford started his tenure at the top of Sky by making an daring claim, that this team would win the Tour de France with a British rider within five years without doping.  This declaration was remarkable because Sky had no experience competing in the grueling and notoriously punishing grand tours, no British rider had ever won the Tour de France, and illegal doping seemed a given if a team planned to win on the biggest stage .  Despite the odds being stacked against them, Sky Cycling went on to win the Tour de France three times in the next five years (2012, 2013, and 2015) with two different British riders.

How did Brailsford manage such a remarkable accomplishment? 

Brailsford built a winning team around the idea that you could gain optimal performance by focusing on the “little things”.  Brailsford coined the phrase “aggregation of marginal gains” to describe this approach to winning.  He said, “You can achieve optimalperformance by the aggregation of marginal gains.  It means finding the one percent improvement in everything you do.”  These little improvements of one percent mean nothing all by themselves, but when added together over time they can make a massive difference.  The chart below shows the idea (chart from - http://www.problogger.net/archives/2014/07/18/10-ways-to-exponentially-grow-your-traffic-in-30-days-2/).


When he said “everything you do” he meant it.  For example, the first thing Team Sky does is teach their riders how to wash their hands.  It sounds silly, but Brailsford and his team started talking to surgeons about the best way to keep their riders healthy and doctors kept arguing that good hand hygiene was key to good health.  A rider that does not get sick is a rider that is able to train.  And Brailsford believed that additional training days would “aggregate over time”.

So, what does this mean for us in education?  The idea I want you to connect to is that if we make small improvements in all areas of the system it can have massive "aggregated gains".  Brailsford is fond of saying that focusing on the small stuff that others often overlook sets his team apart. 

So, if we adopted Brailsford's approach what should we improve by one percent?  Take a minute and think about everything you could improve in your office, school, or classroom by one percent.  Make a list on a piece of scratch paper.  I asked a group of school district leaders to do this recently and in only a few minutes we created a list that spanned three pieces of chart paper.  Here is a partial list:

  • Announcements
  • Lining up
  • Assemblies
  • Relationships
  • Organization of office
  • Asking the right question
  • Unemployment training
  • Washing hands
  • Better toilet paper
  • Transitions in classroom
  • Collaboration
  • Policy communication
  • Parent communication
  • Providing feedback
  • Use of data
  • Tardiness
  • Data accessibility
  • Meeting roles
  • Including more stakeholders
  • Discipline
  • Repetitive work (duplication)
  • Hiring
  • Budgeting
  • Calendars
  • Use of email
  • Hiring
  • Support of assistants
  • Leadership

Some of these suggestions are small (e.g. announcements and lining up), while others are huge (e.g. hiring).  However, there it is reasonable to believe that improving even small processes could have a huge benefit.  Consider for a moment the potential gains from improving how students line up.  A couple of years ago as part of some work I was doing in a district we did time study in a number of classrooms. The observations showed that nearly 30% of the classroom time was spent in "transitions.  Added up over a year that could be 300-400 hours.     Imagined that we improved the process of lining up and saved 1% of the time currently spent in transitions.  Over a year that time save would add up to 3-4 more hours of teaching and learning.  

What do you think you could improve just a little?  Try it out.  Send me a note and tell me how it goes!

Know Your Learning Target

The March 2011 Educational Leadership has an article by Moss, Brookhart, and Long titled “Know Your Learning Target” that argues that students who know their learning goal are “empowered, self-regulating, motivated, and intentional learners.”  With a high quality learning target a student should be able to answer the following questions:

  1. What will I be able to do when I’ve finished the lesson?
  2. What idea, topic, or subject is important for me to learn and understand so that I can do this?
  3. How will I show that I can do this, and how well will I have to do it?

The reality is that “knowing your learning target” can do more than “create empowered, self-regulating, motivated, and intentional learners” when merged with specific goal-setting behaviors that are known to have an impact on student achievement.  Hattie (2009) found that a number of meta-analysis showed a strong positive impact from goal setting (d=0.56). Hattie (2009) cited Locke and Latham’s (1990) seminal book “A Theory of Goal Setting and Task Performance” is arguing that a goal must include:

  1. Clarity (see also Martin 2006)
  2. Challenge (see also Martin 2006)
  3. Commitment (see also Klein, Wesson, Hollenbeck, and Alge 1999)
  4. Feedback

In other words, if the learning targets that students are seeking to achieve are clear, challenging, and include a commitment and cycles of feedback the student is likely to learn even more. 

Klein, Wesson, Hollenbeck, and Alge (1999) Goal commitment and goal-setting process: Conceptual clarification and empirical synthesis. Journal of Applied Psychology, 84(6), 885-896.

Martin, A.J. (2006) Personal Bests (PBs): A proposed multidimensional model and empirical analysis.  British Journal of Educational Psychology, 76, 803-825.