My bet for the overall system architecture:
The cores in each chiplet are logically equidistant, that is, in one CCX. The L3 is 32MB and equally reachable from all cores, with lines hashed to the cache slices by the lowest bit of line addresses. It is not radically faster than Zen L3, and might actually be a bit slower. 30 cycles is already very fast for a 8MB cache, and while the shrinking transistors help speed, quadrupling size is going to hurt it.
There is a huge L4 that is memory side (that is, it only holds data from the memory channels connected to the die). It either is fully inclusive of all the data read from attached memory channels, or has enough extra tags so it at least holds tag entries for all cache lines held from those areas of ram. This allows it to manage all the coherency for multi-socket systems. The size of that L4 is probably 512MB, because the very first leak that mentioned 8+1 dies said that.
The fact that the chiplets are paired up is a red herring, for compatibility with existing cooling solutions or something. There is no data connection between those dies, except through the IO die.