[TYPES] Two-tier reviewing process

Sat Jan 23 16:10:40 EST 2010

A quick follow-up on Andrew's post.

I attended the discussion at POPL on Wednesday (the 20th), and it  
seemed to focus on how/whether to accept more papers.  I observe that  
the original proposal posted to this list (which was soundly rejected  
by straw poll at the meeting) was aimed at improving the review  
process, to be more fair and effective.  Simon's amended proposal for  
accepting more papers starts by saying we should have a "quality bar"  
but does not suggest how to set this bar, and then assumes that  
whatever the method, more papers would be/should be accepted.

I find the lack of discussion on the integrity/quality of the review  
process a bit surprising, and unfortunate.  My sense was that the  
agreement to relax current standards for paper acceptance is based  
somewhat on a resignation that reviewing, whatever the process, is  
destined to be flawed.  To the contrary, I think the review process  
can be improved, and is worth improving.  If we improve the process,  
we will feel much more confident about the papers we accept, and I  
suspect we will be more confident to accept more of them.

To this end, I think that a tiered process can really help.  To add a  
bit more information about it (Andrew's full description in included  
below):

At the POPL discussion, one goal that was raised was to improve the  
number of "expert" reviews per paper. People are dissatisfied when  
their paper is rejected by self-proclaimed non-experts.  I believe  
that Jens pointed out that this year's POPL had 77% papers with one  
"X" review.  I asked Andrew what the related metric was for the  
Oakland tiered review process, and he said:

 > A little hard to compare. We had a 4-point confidence scale, where 3
 > was "confident" and 4 was "this is my area".
 >
 > All papers had at least one 3 or 4 reviewer. 88% had two or more 3
 > or 4 reviewers. A little over half the papers had a 4 reviewer.

I was happy with this year's POPL program so I agree we could probably  
keep accepting more papers (that is, more than 39).  But I think it  
would be great to improve the review process, too.

-Mike

On Jan 16, 2010, at 11:37 PM, Andrew Myers wrote:

> [ The Types Forum, http://lists.seas.upenn.edu/mailman/listinfo/types-list 
>  ]
>
> By popular request, the following is a more detailed description of  
> the
> reviewing process that Dave Evans and I used for IEEE Security and
> Privacy 2009.
>
> The reviewing process used by Oakland 2009 was adapted from a two-tier
> processed used successfully by a few conferences in previous years.  
> It was
> pioneered by Tom Anderson for SIGCOMM 2006, and used subsequently
> by SOSP 2007 and OSDI 2008.
>
> Unlike most conference review processes, we had a two-tier PC, and
> three rounds of reviewing. I believe that this structure helped us
> make more informed decisions, led to better discussions at the PC
> meeting, gave authors more feedback, and resulted in a better product
> overall. We had 77 days to review 253 submissions. This may sound like
> a lot of time, but reviewing for Oakland stretches across Christmas
> and other winter holidays.
>
> The PC of 50 people was divided 25/25 into 'heavy' and 'light' groups.
> Despite the names, these PC members did similar amounts of work.  The
> heavy members did a few more reviews and attended the PC meeting;
> the light members participated in electronic discussion before the
> meeting. Dividing the PC into half meant that we had a smaller group
> at the PC meeting; and had more effective discussions than in previous
> years. A two-tier PC also helped us recruit some PC members who
> preferred not to travel. We did not distinguish between heavy and
> light members in any external documents such as the proceedings. I
> think this helped us recruit light members.
>
> Reviewing proceeded in three rounds, seen pictorially at
> http://www.cs.cornell.edu/andru/oakland09/reviewing-slides.pdf.  We
> started round 1 with 249 credible papers. Each paper received one
> heavy and one light reviewer. Reviewers had 35 days to complete up to
> 12 reviews. Based on these initial reviews, the chairs rejected 36
> papers and marked 33 papers as probable rejects.
>
> In round 2, we had 180 papers considered fully live, each of which
> received an additional heavy and light review. Papers considered
> probable rejects were assigned just one additional reviewer. Round 2
> started just after Christmas, and reviewers had 20 days to complete
> up to 12 reviews. After round 2, we had 3-4 reviews per live
> paper. Papers all of whose reviews were negative were rejected at this
> point, with some electronic discussion to make sure everyone involved
> agreed.
>
> By round 3, we were down to 68 papers, most of which were pretty good
> papers.  Each live paper now received one additional heavy review,
> ensuring that there were three reviewers present at the PC meeting for
> each discussed paper.  Reviewers received up to five papers to review,
> in ten days.  Based on these reviews and more electronic discussion,
> we rejected four more papers.  All papers with some support at this
> point made it to the PC meeting.  The chairs actively worked to
> resolve papers through electronic discussion, which was important in
> achieving closure.
>
> The PC meeting was a day and and half long, and resulted in 26 of
> the 68 papers being chosen for the program. Each paper was assigned
> a lead reviewer ahead of time. The lead reviewer presented not
> only their own view, but also those of the light reviewers who
> were not present. Where possible, we chose lead reviewers who were
> positive and confident about their reviews. At some points, we had
> breakout sessions for small groups of reviewers to discuss papers in
> parallel. However, no paper was accepted without the whole PC  
> hearing the
> reasons for acceptance. This seems important for a broad conference  
> like
> Oakland (or POPL).
>
> One benefit of multiple rounds of reviewing was that we could do a
> better job of assigning reviewers in later rounds, for three reasons:
> first, the reviews helped us understand what the key issues were;
> second, we asked reviewers explicitly for suggestions; third, we
> could identify the problematic paper where all the reviews were
> low-confidence and do hole-filling.  We also asked external experts to
> help review papers where we didn't have enough expertise in-house.
>
> In the end, all papers received between 2 and 8 reviews, and accepted
> papers received between 5 and 8 reviews. The multiround structure
> meant that reviewing effort was concentrated on the stronger papers,
> and authors of accepted papers got more feedback, and often more
> expert feedback, than they had in previous years. The reviewing load
> was increased slightly over previous years for heavy reviewers (~23),
> but decreased slightly (~20) for light reviewers. Keeping load mostly
> constant was possible because we had a larger PC than in the past.
> The two-tier structure meant that despite a larger PC, we could have a
> smaller PC meeting.
>
> Filtering out weak papers early helped keep the reviewing load
> manageable.  Papers were rejected after round 1 only when they had
> two confident, strongly negative reviews. The chairs did this in
> consultation with each other. Papers with very negative reviews
> but without high confidence, or confident reviews that were not as
> negative, were considered probable rejects and assigned a third review
> in round 2. If that review was positive, the paper received three
> reviews in round 3 instead of the usual one, ensuring that it made
> it to the PC meeting (this only happened in a couple of cases).  PC
> members did not report any concerns to us that good papers might have
> been filtered out early.
>
> Assigning the right reviewers in round 1 makes both filtering and
> assignment of additional reviewers more effective. To be able to
> assign round-1 reviewers efficiently, it is important for the chairs
> to get as much information from the PC as possible about what papers
> they would like to review and about what topics they are expert.
>
> A final issue we put thought into was the rating scale. While the
> rating scale might not seem that important, in past years the Oakland
> committee had found that a badly designed rating scale could cause
> problems.  The four-point Identify the Champion scale (A-D) used by
> many PL conferences works fine for single-round reviewing.  But for
> multiple rounds with early filtering, it's helpful to distinguish the
> papers that are truly weak from the ones that merely don't make the
> grade. Therefore, ratings came from the following scale:
>
>    1: Strong reject. Will argue strongly to reject.
>    2: Reject. Will argue to reject (Identify the Champion's D)
>    3: Weak reject. Will not argue to reject (C)
>    4: Weak accept. Will not argue to accept (B)
>    5: Accept. Will argue to accept. (A)
>    6: Strong accept. Will argue strongly to accept.
>
> As in Identify the Champion, giving the ratings meaningful semantics
> helped ensure consistency across reviewers.  Papers that received 1's
> and 2's were easy to filter out after round 1; we rejected papers
> with confident 1/1 or 1/2 ratings, and some 2/2's. Having the extreme
> ratings of 1 and 6 also seemed to give reviewers a little more excuse
> to use 2 and 5 as ratings, staking out stronger positions than they
> might have otherwise. The absence of a middle 'neutral' point usefully
> forced reviewers to lean one way or the other.
>
> Overall, this reviewing process probably involved somewhat more total
> work for the chairs than a conventional reviewing process, but it was
> also spread out more over the reviewing period. Problems could be
> identified and addressed much earlier. Total work for PC members was
> comparable to a conventional process.  Some PC members appreciated
> that the multiple intermediate deadlines prevented a last-minute rush
> to get reviews done, and that the average quality of reviewed papers
> was higher.
>
> Hope this helps,
>
> -- Andrew