Peer reviewer incentives and anonymity

Last time, I sketched a model of the peer review process as an extensive-form game. The model described the review process as a noisy measurement: the paper has some quality, and the review measures that quality plus some bias and some variance. With greater effort, the review’s variance can be lowered. The game I described was one-shot, about a single paper going through the process.

I didn’t describe the reviewer’s incentive to exert effort to carefully evaluate a paper, because within the one-shot game, there is none. To get the referee to exert nonzero effort, there has to be another step inserted into the game:

  • Based on referee’s observed effort level, the editor, author, or reader reward or punish the referee.

This post will discuss some of the possible ways to implement this step.

My big conclusion is that anonymity in peer review is more of a barrier than a help. Having reviewers sign their names opens up the possibility of publishing the reviews, which turns a peer reviewer into a public discussant of the paper, and turns the review itself into a petit publication. Journals in the 1900s couldn’t do this because of space limitations, but in the world where online appendices are plentiful, this can be a good way to reward reviewers for putting real effort into helping readers and editors understand and evaluate the paper.

Quality

The problem is not about getting the referee to agree to review the paper and give an accept-or-reject decision, and in fact the game I wrote up last time assumes that the referee agrees to review all papers sent to her by the editor. Rather, the problem is to get the referees to exert as high a level of reviewing effort as possible.

It’s easy to do a cheap job of doing a review, such as checking down a list of litmus tests. Check the literature: did the author cite me? Check the intro: is the paper in my preferred paradigm? Check the methods: are my methodological pet peeves addressed? A referee could do this in ten minutes, sound reasonably intelligent, and would be doing a disservice to both author and editor. What induces the referee to go beyond skimming for her name and really engaging with the author?

Goodwill

Goodwill is generally sustainable only in a small group. Social norms can maintain order and elicit effort in a group of five people, but depending on the same norms to extract effort from a group of a thousand is a recipe for eventual failure. [A proper proof of this would involve a repeated Prisoner’s Dilemma and checking the conditions for trembling-hand equilibria. There are probably a dozen other ways to formally show that goodwill doesn’t scale.]

The peer review system originated in the 1600s, with the Royal Society of London, an organization that started with about forty dues-paying members. The number of academics in the present day is… significantly greater than forty, and the referee and author can often expect that they will never meet in person. Be sure to include in that set of potential referees disgruntled graduate students who are on their way to never becoming serious academics and industry members who are knowledgeable but on the periphery of the academic community.

Academia—even any one field—is now huge. For all of the incentives below, we have to ask does it scale?, and goodwill simply does not.

Cash

Some journals offer circa $100 to referees. You’ll be hard-pressed to find anybody who seriously cares. For a professor who could read a paper quickly, $100 is roundoff error to her salary; for a grad student, $100 may not cover the full day(s) it will take to learn everything needed. There have to be people out there for whom $100 is a real incentive to do a review, but the question of whether a mechanism scales is not about whether such people exist, but whether they are so common that we could build the core of academia around them.

For setting a flat fee to induce high effort (not just enough effort to score the $100 and run away), we would need a marketplace where reviewers expect to see more future fixed payments given higher effort. This works for plumbers, because there’s a marketplace of plumbers competing for a steady stream of jobs. I don’t think there exists a comparable corps of postdocs who compete to earn a salary from $100 paper reviews (and thus have no time to do any original research…).

Interest in the paper

Another rationale is that we all want to stay abreast of our field, so when an editor asks us to read a new paper we’ll be eager for the insider view of cutting-edge work. This first assumes that the editor found the right reviewer (which is often not the case; I’ve been on both ends of that mismatch).

Given the set of the 99 other papers that came out this month and the paper the editor asked me to review, the odds are roughly one in a hundred that my top priority will be the paper sent by the editor. That is, the story that referees are motivated by interest in the paper still depends on having some other motivation, like goodwill, pushing this random paper to the top of the stack.

And, once again, we seek high-effort reviews. When a referee reads the first two pages and decides that the paper is not really worth her time, we need some other force pushing her to finish the paper and articulate in a reasonable way why the paper was not interesting to her, who might be interested in it, and how it could be better.

Reputation

The editor knows who the reviewer is, and can punish her for a bad review. How would the editor do so?

One approach to punishing a bad referee report is that the reviewer will one day also be an author, and the editor could be more stringent when reading submissions from a bad reviewer. With a bit of editor-to-editor gossip, it may not even matter whether the was-reviewer/now-author is submitting to the same journal or another in the same field.

This is a realistic means of developing reasonable incentives for reviewers. One could even argue that it scales. Also, it is insidious and unethical. The purpose of the peer review system is to evaluate a paper based only on its merits as a contribution to the literature, not the personal qualities of the author. Obscure authors should have the same chance the superstars have, and uncooperative jerks deserve to have the same chance the agreeable have.

The uncooperative can also be punished in professional societies or even hiring decisions. A smart editor, then, will select reviewers that are active in the same societies or departments that the editor is active in, so that the editor can punish or reward. The implication here is that the reviewers are more likely to be active insiders to the field, which can bring with it a host of other incentive problems: if we pick a reviewer because he has a position to protect, why wouldn’t the reviewer then be more likely to reject any paper that threatens that position? Because the referee is not anonymous to the editor and they are in the same network, would the referee be inclined to suppress opinions the editor might disagree with?

Regardless of anonymity, a referee who is part of a network of connected authors has every incentive to promote the works of in-network authors and recommend that the work of out-of-network authors be discarded. That is, the reviewers most vulnerable to punishment via their established networks may also be the ones most biased toward what the field looked like a decade ago and most resistant to new ideas.

Anonymity

The reviewer is allegedly anonymous, preventing retaliation from the author for a negative review or post-review favors for a positive review. That is, the argument for anonymous reviews is that onymous reviews will be biased upward.

First, this is not truly a problem. If your bathroom scale is always off by two kilos, you just have to remember to always subtract two from the reading. The people who read applications with recommendation letters are aware that 100% of recommendation letters are positive, and read accordingly.

Because the framework from last time was a simple Bayesian aggregation of signals from Normal distributions, we can make a few formal statements about referee bias.

Theorem: If the mean of the added error term in referee reports is a known, fixed amount, then the final results are equivalent to those one would get from an error term with mean zero.

Theorem: The same holds if the mean of the added error term is a fixed, known percentage of the true paper utility (e.g., the referee always bumps up the score by 10%).

Theorem: Let there be a cutoff K, such that if a referee evaluates a paper to have utility below K, the referee reports K. Assume that K is below the editor’s cutoff for paper acceptance. Then the editor’s decisions are equivalent to the game with no cutoff.

Bias becomes a problem when it is inconsistent and unpredictable, and the proposition that reviews are more consistently biased when anonymous than when onymous is not at all clear.

However, anonymity creates problems. The first well-known and well-documented problem, known as The GIF Theory , is expressed in equation form as

(normal person) + (anonymity) + (audience) = (total f*ckwad).

I and every single academic I have ever talked to about the matter have received at least one rabid report from a referee who clearly did not pay serious attention to the submitted paper and clearly demonstrated the GIF theory. Those readers might have been partisans to a competing paradigm, disinclined to exert the real effort required to read an academic paper, or just naturally belligerent. It is costly for editors to throw out these reviews and start the process over with a new referee, so they are generally disinclined to do so (though I did once have an editor who did throw out a super-negative report for being clearly biased).

Second, anonymity throws out many of the reputational incentives that could have been brought to bear on the central problem of devising incentives for the referee to exert costly effort.

I open all of my reports with the sentence “My name is Ben Klemens”, and then give a sentence or two explaining my experience and why I think I am a reasonable choice for a referee. I write a lot more rejections than acceptances, and having my name be the first thing the author reads in my report means that I can’t be dismissive and can’t write a review based on a cursory reading. My rule for all Internet correspondence is that I don’t say anything to somebody via a keyboard that I wouldn’t say directly to his or her face, and the same should hold for a referee report. For me, forcing myself to not be anonymous largely solves the problem of making sure that I exert due diligence in reviewing the paper.

One could throw the doors even wider by publishing the reviews as a supplement to the paper. We already have a model for this, in the form of discussants at conference panels. The bias story above still holds: discussants tend to bias their discussion toward politeness, but we in the audience can usually pick up subtle (or not-subtle) signals that the discussant is not happy with a paper.

For those of you who go out to see live pop music, you have another model for this: the opening act.

All of academia runs on publication counts, so for any author, another publication is a good. Onymous publication of the referee reports thus gives referees a solid incentive to put real effort into the review. It forces referees to bias their writing toward the civil, which is irrelevant in the model and a good thing to us as human beings in a society. It also forces referees to be forthright about major problems, because if readers pick up on a problem that the referee/discussant ignores, it reflects badly on the discussant.

Most papers are rejected in the end, so most referee reports would never be published under a discussant model. However, if there is some possibility that the paper will be published, then there is some possibility that the discussant’s review will be published, and the discussant therefore has some incentive to still exert effort in writing a review.

Publication of the discussion gives much more information to the reader. If the paper was published only after cursory review that the figures are formatted right, or based on a glowing review by the author’s department chair, that is a flag for the reader to read the paper with healthy skepticism. High-effort reviews are a real signal of quality, and help the reader get more out of the paper.

The recommendation of publishing reviews as discussants to the main paper does not mean that we are opening up papers to blog-style comments by anybody with a keyboard and a grudge. The editor selected the reviewers, and could ostensibly edit the reviews the way he edits the main paper. In fact, a typical referee report, with a list of minor errors to be fixed and other comments of interest only to the author and editor, would probably be more useful to readers after a pre-publication trim by the editor.

The referee has a stronger incentive to over-endorse borderline papers, because the referee scores a publication iff the main paper gets published. However, this can only go so far, because if there are clear errors in the paper that the referee missed, then the referee will look bad for being so positive about a bad paper.

Also, referees have a responsibility to help the authors of papers that have zero chance of publication to understand what they are missing and how they can make their next paper better. Because referee reports for such papers have zero chance of publication, we are again relying on goodwill to get the referee to be helpful.

As a final benefit, if reviews are published onymously, we can expect that the number of rejections of the form “this paper failed to cite me” will plummet.

Conclusion

The refereeing and editing process do add value for readers. A high quality bar means that readers can exert less effort in reading the paper, trusting that the likelihood of fatal efforts and useless results have gone through a good first weed-out. But the process wherein referees send detailed reports to the editor, then the editor makes an accept-or-reject decision, loses a great deal of information. If readers have more information from the referee, they can evaluate the paper still more efficiently.

The central problem of designing the peer review process is in working out incentives for referees to exert effort. The reasons referees have to exert effort are typically loose, involving vague hints that punishment is possible, or ethical arguments that good review is the right thing to do. As academia has blossomed into a worldwide, million-person endeavor, we’ve watched these loose incentives, and thus the peer review process itself, fail to scale. Making the process anonymous solves a bias problem that doesn’t exist, but meets one of the conditions of the GIF theory and blocks avenues for creating incentives for reviewer effort. Onymously publishing the reviews as supplements to the paper can give referees an incentive to exert real effort in the review process, and makes the process more efficient for readers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s