[TYPES] Two-tier reviewing process

Sat Jan 16 17:37:49 EST 2010

By popular request, the following is a more detailed description of the
reviewing process that Dave Evans and I used for IEEE Security and 
Privacy 2009.

The reviewing process used by Oakland 2009 was adapted from a two-tier
processed used successfully by a few conferences in previous years. It was
pioneered by Tom Anderson for SIGCOMM 2006, and used subsequently
by SOSP 2007 and OSDI 2008.

Unlike most conference review processes, we had a two-tier PC, and
three rounds of reviewing. I believe that this structure helped us
make more informed decisions, led to better discussions at the PC
meeting, gave authors more feedback, and resulted in a better product
overall. We had 77 days to review 253 submissions. This may sound like
a lot of time, but reviewing for Oakland stretches across Christmas
and other winter holidays.

The PC of 50 people was divided 25/25 into 'heavy' and 'light' groups.
Despite the names, these PC members did similar amounts of work.  The
heavy members did a few more reviews and attended the PC meeting;
the light members participated in electronic discussion before the
meeting. Dividing the PC into half meant that we had a smaller group
at the PC meeting; and had more effective discussions than in previous
years. A two-tier PC also helped us recruit some PC members who
preferred not to travel. We did not distinguish between heavy and
light members in any external documents such as the proceedings. I
think this helped us recruit light members.

Reviewing proceeded in three rounds, seen pictorially at
http://www.cs.cornell.edu/andru/oakland09/reviewing-slides.pdf.  We
started round 1 with 249 credible papers. Each paper received one
heavy and one light reviewer. Reviewers had 35 days to complete up to
12 reviews. Based on these initial reviews, the chairs rejected 36
papers and marked 33 papers as probable rejects.

In round 2, we had 180 papers considered fully live, each of which
received an additional heavy and light review. Papers considered
probable rejects were assigned just one additional reviewer. Round 2
started just after Christmas, and reviewers had 20 days to complete
up to 12 reviews. After round 2, we had 3-4 reviews per live
paper. Papers all of whose reviews were negative were rejected at this
point, with some electronic discussion to make sure everyone involved
agreed.

By round 3, we were down to 68 papers, most of which were pretty good
papers.  Each live paper now received one additional heavy review,
ensuring that there were three reviewers present at the PC meeting for
each discussed paper.  Reviewers received up to five papers to review,
in ten days.  Based on these reviews and more electronic discussion,
we rejected four more papers.  All papers with some support at this
point made it to the PC meeting.  The chairs actively worked to
resolve papers through electronic discussion, which was important in
achieving closure.

The PC meeting was a day and and half long, and resulted in 26 of
the 68 papers being chosen for the program. Each paper was assigned
a lead reviewer ahead of time. The lead reviewer presented not
only their own view, but also those of the light reviewers who
were not present. Where possible, we chose lead reviewers who were
positive and confident about their reviews. At some points, we had
breakout sessions for small groups of reviewers to discuss papers in
parallel. However, no paper was accepted without the whole PC hearing the
reasons for acceptance. This seems important for a broad conference like
Oakland (or POPL).

One benefit of multiple rounds of reviewing was that we could do a
better job of assigning reviewers in later rounds, for three reasons:
first, the reviews helped us understand what the key issues were;
second, we asked reviewers explicitly for suggestions; third, we
could identify the problematic paper where all the reviews were
low-confidence and do hole-filling.  We also asked external experts to
help review papers where we didn't have enough expertise in-house.

In the end, all papers received between 2 and 8 reviews, and accepted
papers received between 5 and 8 reviews. The multiround structure
meant that reviewing effort was concentrated on the stronger papers,
and authors of accepted papers got more feedback, and often more
expert feedback, than they had in previous years. The reviewing load
was increased slightly over previous years for heavy reviewers (~23),
but decreased slightly (~20) for light reviewers. Keeping load mostly
constant was possible because we had a larger PC than in the past.
The two-tier structure meant that despite a larger PC, we could have a
smaller PC meeting.

Filtering out weak papers early helped keep the reviewing load
manageable.  Papers were rejected after round 1 only when they had
two confident, strongly negative reviews. The chairs did this in
consultation with each other. Papers with very negative reviews
but without high confidence, or confident reviews that were not as
negative, were considered probable rejects and assigned a third review
in round 2. If that review was positive, the paper received three
reviews in round 3 instead of the usual one, ensuring that it made
it to the PC meeting (this only happened in a couple of cases).  PC
members did not report any concerns to us that good papers might have
been filtered out early.

Assigning the right reviewers in round 1 makes both filtering and
assignment of additional reviewers more effective. To be able to
assign round-1 reviewers efficiently, it is important for the chairs
to get as much information from the PC as possible about what papers
they would like to review and about what topics they are expert.

A final issue we put thought into was the rating scale. While the
rating scale might not seem that important, in past years the Oakland
committee had found that a badly designed rating scale could cause
problems.  The four-point Identify the Champion scale (A-D) used by
many PL conferences works fine for single-round reviewing.  But for
multiple rounds with early filtering, it's helpful to distinguish the
papers that are truly weak from the ones that merely don't make the
grade. Therefore, ratings came from the following scale:

    1: Strong reject. Will argue strongly to reject.
    2: Reject. Will argue to reject (Identify the Champion's D)
    3: Weak reject. Will not argue to reject (C)
    4: Weak accept. Will not argue to accept (B)
    5: Accept. Will argue to accept. (A)
    6: Strong accept. Will argue strongly to accept.

As in Identify the Champion, giving the ratings meaningful semantics
helped ensure consistency across reviewers.  Papers that received 1's
and 2's were easy to filter out after round 1; we rejected papers
with confident 1/1 or 1/2 ratings, and some 2/2's. Having the extreme
ratings of 1 and 6 also seemed to give reviewers a little more excuse
to use 2 and 5 as ratings, staking out stronger positions than they
might have otherwise. The absence of a middle 'neutral' point usefully
forced reviewers to lean one way or the other.

Overall, this reviewing process probably involved somewhat more total
work for the chairs than a conventional reviewing process, but it was
also spread out more over the reviewing period. Problems could be
identified and addressed much earlier. Total work for PC members was
comparable to a conventional process.  Some PC members appreciated
that the multiple intermediate deadlines prevented a last-minute rush
to get reviews done, and that the average quality of reviewed papers
was higher.

Hope this helps,

-- Andrew