I think in many ways a test suite would be more generally useful, but part of what I personally find challenging is simply determining what it even means to "handle a response."
A reference implementation would solve this.
I know of a few efforts in this regard, but most of them seem to have an iron triangle around the three criteria above:
One or two of the three only. Never all three together.
So far I have developed some heuristics when looking at implementations here.