A model of peer review

Science, a peer reviewed journal, recently published an article lambasting the quality of peer review in many journals that are not Science. The article described itself as “the first global snapshot of peer review across the open-access scientific enterprise”, and found peer review in that context to be lacking.

As one who leans toward the theoretical and the methodological, I naturally wonder what is the model underlying the claim that “peer review across the open-access scientific enterprise” would be of low quality. My understanding is that “open-access” is defined to include any journal that does not charge subscription fees, but allows readers free access via the Web. So we need some sort of model that explains why the lack of reader fees would lead to a consistently lower quality of referee effort.

Generally speaking, the discussion about scientific peer review tends to be…lacking in scientific rigor. Those who have written on the matter, including some involved in open access journals, all seem to agree that a claim that open access would induce lower referee effort makes little sense. It’s basically impossible to write down into a real model.

So in this and the next column, I attempt to fill the gap and provide a theoretical framework for describing a one-paper peer review process. I get halfway: I stop short of the general-equilibrium model covering the entire publication market. I also don’t specify the cost functions that one would need to complete the model, because they wouldn’t make sense in a partial equilibrium model (i.e., there’s no point in a specific functional form for the cost function without other goods and a budget constraint).

Nonetheless, we already run into problems with this simple model. The central enigma is this: what incentive does a referee have to exert real effort in reviewing a paper?

After the break, I will give many more details of the game, but here are the headline implications of the partial model so far, which don’t yet address the central enigma:

  • The top-tier journal does not necessarily have the best papers. This is because the lower-tier journals have papers that have gone through more extensive review.
  • More reviews can raise reader confidence that a paper is good. However, the paper is published after only a handful of reviews. Stepping out of the game, situations where dozens or hundreds read the paper before publication would do much to diminish both false positives and false negatives in the publication decision.
  • Readers are more likely to read journals that maintain a high standard.
  • Readers are also more likely to read journals where the referee exerted a high level of effort in reviewing the papers, and can also read those papers with less effort. The problem of trusting a false paper is mitigated, because careful reviews produce fewer false positives. However, referee effort is not observable.

All of this is still under the assumption that referees have an incentive to put real effort into the review process, an assumption I’ll discuss further next time.

After the jump, the rest of this entry will go into more precise detail (~3,000 words) about the game and some of its implications.

The game

In this section, I define the actors, the procedure, and the payoffs in the game. The actors, and their genders assigned using a random number generator:

  • The author [F]
  • The reviewer [F]
  • The editors of two journals: the Journal of Studies One (JSO) and the Journal of Studies Two (JST) [both M]
  • The reader [F]

Before describing the procedure of the game, I have to describe a production process, herein called “reading the paper”, which takes as input some level of costly effort, and produces as output a single measure of the utility/quality of the paper and a list of errors found in the paper.

  • For concreteness, let us set the true utility of the paper to be between 0 and 100 utils (where utils are the made-up unit economists use to measure utility. Given a univariate measure, I use “utility” and “quality” interchangeably).
  • All papers have a positive number of errors. Errors may be factual blunders, badly-explained parts, omitted relevant citations, or included irrelevant citations. Fixing errors raises the utility of the paper.
  • The output measure equals the true utility of the paper plus a Normally-distributed error.
  • The mean of the error will be assumed to be zero for now.
  • The variance of the measure will be inversely proportional to effort; let us set the standard deviation equal to 100/(1 + effort). Truly extensive effort will thus reveal the true value of the paper with near-certainty.
  • The set of errors caught in the paper is a fraction of the total number of errors, where greater effort finds more errors.

I like the fact that, although the true utility of the paper is between 0 and 100, the Normal distribution of errors is not bounded. A zero-effort review will be a draw from a Normal distribution with mean equal to paper quality and standard error equal to 100, making any value between about -100 and 200 plausible. Having read many referee reports, the possibility that a reviewer sends back an impossibly high or low rating seems realistic to me.

Here in the real world, the cost of reading a paper is high enough that it is impossible for any actor to read more than a small fraction of the papers submitted for publication. That will be the only cost in the game. From the 1600s to the 1900s, it was publication and dissemination that was the real cost, and we would have written a different model for publication during that period. Even in the present day, a journal of sufficiently high quality that we care to pay it attention will have an editorial staff who fixes typos and gets the margins right, but I am comfortable abstracting away from that. And publishing a PDF on the world wide web is cheap enough that we can consider it to be free.

By assuming that the the real cost is editor, referee, and reader time, this model predicts no difference in peer review quality across access types, and I invite those who believe there is a difference to extend this or any other model to explain how reader fee mechanisms affect referee effort levels.

The procedure for the peer-review game:

  • The author writes a paper. As above, it has a certain unobserved utility to readers.
  • The author submits it for review to the JSO.
  • The JSO editor sends the paper to a reviewer.
  • The reviewer exerts effort to read the paper, and reports her measure to the editor and the errors found to the author.
  • The editor makes a binary decision to publish or not based on whether the reported measure is greater than a preselected cutoff. The cutoff is public knowledge (in the form of the journal’s reputation).
  • If rejected, the author revises the paper based on comments from the reviewer and submits to the JST, which repeats the above steps.
  • If published, the reader decides whether to exert effort to read the paper.
  • The reader cites the paper if the mean posterior estimate of the utility is greater than a cutoff set by the reader.

Finally, we need to define incentives for the players. As above, I have chosen not to give specific cost or utility functions, because they won’t make sense in the context of a one-shot game.

  • For all actors, time and effort are scarce.
  • The author wants the paper to be cited.
  • The reviewer wants⋯ we’re not sure yet. See the next entry. But we will here assume that the reviewer has enough incentive to produce a reasonably informative report (say, one with standard error less than ten).
  • The editors want papers in the journal to be read and cited.
  • The reader wants to efficiently learn new things and build his or her own research on the shoulders of reliable work.

Even with the acknowledged holes in the model, we can already start to consider what the actors will do.

The author

The author’s strategic options are limited because the game form states that she will submit to the JSO first and then, if necessary, the JST. The real-world problem is much more complex: the top-tier journals have both stated and unstated editorial preferences, and frequently reject papers that are correct and well-done but don’t seem important in the eyes of the editors. There is greater competition for space in the top-tier papers because everybody submits to them. So an author who wants greater odds of quicker publication might skip the JSO and go straight to the JST. This is especially true in the social sciences, where a half-year wait for a cursory review is not unusual.

As written, the author has two choices in allocating effort. The first effort expended is in initially writing the paper, and the second is effort expended after review. The constraints on how much effort is expended would be based on out-of-game considerations regarding hours in the day, the author’s innate abilities, and allocating effort to other papers elsewhere in the pipeline.

If we assume that fixing errors is costless once they are pointed out, and that fixing errors improves the quality of the paper, then it makes sense to wait for reviewers to find bugs in the paper.

Assertion: One could devise specific payoffs for this game such that the strategy of writing a low-effort paper, submitting it for review, then rewriting based on revisions is optimal. I will call this the “post-review rewrite strategy”.

The post-review rewrite strategy is not unheard of in academia, and is especially reasonable for academics who are socially disconnected from the given field and would otherwise be unable to obtain high-quality commentary.

The editor

Real-world editors have the difficult task of selecting appropriate reviewers and getting them to take the job seriously and reply in time. If there were one reviewer who tended to exert high effort and one who tended toward low effort, the editor would have to choose which papers to send to which reviewers. The game here abstracts those considerations away.

Also, selecting the cutoff is nontrivial and requires more information than we have in this game, like page counts, an estimate of the overall distribution of paper quality in the field, et cetera.

We might expect that the editor of the JST has a lower cutoff than the editor of the JSO, because the JSO editor has already selected the most high-quality papers, but because the author has the option to use the post-review rewrite strategy, even this is not given. If the post-review rewrite is common, the JST will be a uniformly better journal.

Following that strategic thread a little further, if the JSO editor has control over the referee for a paper, the JSO editor may want to give low-quality reviews to papers he is confident will be rejected, because he is otherwise just helping out his competitor at the JST.

Theorem: as more readers evaluate a paper, the odds of both false acceptance and false rejection fall.

This is a theorem and not an assertion because we already have enough structure to prove it. Simply apply Bayesian updating using the error form above.

More eyeballs also improves the odds of high quality because there were more chances for feedback and catching errors large and small. The best papers are often those that were presented at seminars, read by colleagues, and otherwise vetted before the formal journal peer review process.

The reader

Any combination of reader strategies—read only JSO, read only JST, read both, read neither—could be sustained by some cost function. But even within the game as written, we can say a lot about what readers will do.

Assertion: Given that the reader has chosen to read a journal. As the cutoff for publication in the journal rises, the reader will exert less effort in reading the paper.

The explanation for the assertion is that publication indicates that there is already one read of the paper which is itself unobserved but that finds a utility greater than the cutoff. As the cutoff rises, the expected value of the unobserved measurement rises. The reader’s goal is to determine whether the expected value of the paper is greater than the reader’s cutoff, and the reader can use a noisier measure to determine this when there is indication of a higher measurement.

That is good because our goal is efficient dissemination, so if a hundred readers can let their guard down a little because the editor maintained a high standard, then that’s a clear victory for peer review. This is bad because it means that the occasional schlock that gets into a top journal will be less likely to be discovered as bad. In the pop press, where readers are unable to put in the effort to evaluate a paper, we see this taken to the extreme, where anything in a top journal is taken as gospel and anything not in a well-regarded peer reviewed journal is marked as suspect. That’s over-amplifying a signal that could have been from three or four people who each put an hour into reading the paper before publication.

You may have read about the arsenic-based life embarrassment in Science, the journal that published the article throwing stones at other journals’ peer review mechanisms. As the linked blog points out, being published in a peer-reviewed journal means that one editor and between one and three reviewers thought that the paper was worth publishing. If there was lab work or data analysis, the odds are frankly very low that the reviewer really evaluated the quality of any of it.

The model here doesn’t really take learning from the paper into account, as the reader simply wants to cite good papers and not cite bad papers. One could construct a cost function for reading the paper such that, if the editor’s cutoff is larger than the reader’s cutoff, then the reader doesn’t read the paper at all, but cites the paper anyway.

A student wrote to Richard Feynmann, pointing out an error in his Lectures on Physics that she discovered because she repeated his mistake on an exam and was marked off. Feynmann replied:

[…] You should, in science, believe logic and arguments, carefully drawn, and not authorities. […] I goofed. And you goofed, too, for believing me.

Assertion: If reviewer effort is observable (which it isn’t in this game), then as reviewer effort rises, the reader will exert less effort in reading the paper.

This assertion (whose proof would be much like the last one) tells us that reviewer effort can substitute for reader effort. This would also improve efficiency, as one reviewer who puts ten hours into reading a paper could save a thousand readers an hour each.

Theorem: all else equal, readers will be more likely to read papers in a journal with a higher cutoff.

The reader makes a cost/benefit calculation before reading: if (benefit to citing a good paper)*P(posterior paper quality is above reader cutoff) > (cost of eval), then read; else it’s probably not worth it. The theorem comes directly from the fact that P(posterior paper quality is above reader cutoff) rises as the editor’s cutoff rises. The proof is once again a not-difficult application of the Bayesian updating formula.

So, yes, reputation matters, and journals with a stringent cutoff are more worth reading. Reviewer effort is not necessarily in any way related to the editor’s cutoff, but we can add another theorem about reviewer effort.

Theorem: Assume (contrary to the game as described) that readers can somehow observe review effort. All else equal, readers will be more likely to read papers in a journal with higher-effort reviews.

The proof involves going through the same math about Bayesian updating with Normal distributions.

So even without explicit cost functions and ignoring the reviewer’s incentive problem, we already have some results:

  • The top-tier journal does not necessarily have the best papers, because papers in lower-tier journals have undergone more phases of review.
  • A paper being published means that its quality was evaluated to be high enough by one or two reviewers. I didn’t write out distributional assumptions, but both false positives and false negatives will be common in any reasonably-calibrated model.
  • Readers will exert less effort in evaluating publications with high standards, which gives editors a good incentive to keep standards high, and which create efficiencies. But the actual signal is not very strong, and in the real world the weak signal may be over-stressed.
  • If we could observe referee effort, then we could use that to pick which journals are worth reading, and could exert less effort in reading them. Too bad that’s impossible.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s