Independent filtering and confusing consequences
I am hoping for some ideas or feedback, maybe I am seeing it not clearly enough but here is what I noticed today when looking at the results from Neha's CD4+ dataset: In the PDF that compares the permuted and real numbers (the number of connections that survive), a somewhat strange and non-intuitive phenomena happens currently:
- When filtering with TF-peak FDR of 0.2, 20 connections survive
- When filtering with TF-peak FDR of 0.3, only 3 connections survive
Sounds like a bug, but the real reason behind is actually different:
- while all filters are identical except the TF-peak FDR, the reason is that the BH-adjustment for the peak-gene raw p-value is done as one of the last filters. This is a kind of independent filtering that is quite typical in similar areas (DESeq, for example) so that the adjusted p-values are not as affected by the sheer number of rows in the beginning. With a TF-peak FDR of 0.3, there are more rows surviving, but the BH adjustment than essentially results in higher p-values for the peak-gene links and then they are filtered. For the 0.2 variant, more survive in the end because of that.
What to do? This was only visible so far for the permuted versions because the numbers are so small there, I have not seen this effect for the real data. Still, it looks weird in the beginning.
Happy for suggestions. If I apply this filter on the same set of connections at the beginning, almost nothing will survive because adjusting millions of rows.. you know.