GC Correction for TF-peak links and additional criteria for foreground and background
As discussed, an additional feature will be implemented that constructs a GC-corrected background in addition to the GC-unaware way we are doing it now. The FDRs for the TF-peak links will then also have two versions.
Initial questions for the specific implementation and edge cases we have to deal with:
- Do we take the GC content for the peaks or the specific TFBS? Both should be possible, as we originally assign a 1 or 0 to a TF - peak pair given the predicted TFBS. If we use the GC content of the peak, this might not correlate well with the GC content of the actual TFBS, especially for larger peaks. If we use the GC content of the TFBS, a peak will usually have different GC contents, namely the one specific from the exact TFBS that overlaps with the peak. I did not think it through fully, but I guess the TFBS-specific one should be prefered? I am not sure what it means for the implementation or the logic of our FDR procedure, though, when a peak has different GC contents yet.
- What to do if the GC-specific distribution of the foreground cannot be matched with the background? For example, 80% of the foreground is GC-rich, while the background contains not too many GC-rich regions? At which point we have to flag the background curve because it does NOT match the foreground well? Shall we use a new metric to capture how well the GC-matched background worked, from 0 to 1 in a way so we get an idea of how well this works overall? How this metric would be constructed?
- Sizes of background and foreground: Shall we require minimum sizes for foreground and background so we have some confidence in the distributions that is the bases for our FDR procedure? Taking SP1 as example, is 3k background enough for us with 20k foreground, and what do we do if either foreground or background is to little? I guess setting to NA would be the best as we should not have (false) confidence when numbers are too low. This point is independent of the GC correction actually.
- How to balance between "maximize the number of peaks in the background versus mimic the foreground GC-distribution as much as possible"? Wich quantitative rules we want to apply? Example: 10k foreground regions, 15 k background regions. 80% of foreground are GC rich (80% and above), while only 2k regions in the background have GC rich regions. We can then either (1) Use all 2k from the background, so that this becomes also 80% relatively speaking, and fill the remaining 20% with a matching that is as good as possible for the other GC bins. Still means we will not have more than 2.4k or so regions in the background in total due to the limitation that the GC-rich regions should be 80% of overall regions, so we throw away many background regions. Alternatively, we use more background regions, but then we cannot maintain the 80% threshold, how we balance this? We need specific rules. I am sure across datasets, we will for sure run into all of these edge cases, so I want to have a good idea of these while implementing already.
I will start with asap after the Moritz data analysis so might be only in the new year.