Suggestions

:speech_balloon:

Addressing spurious correlations

This is a really vague suggestion since I’m not strong enough in stats to understand how to fix these problems. However, I’ve noticed a few categories of obviously spurious correlations with my symptoms and factors.

  1. High-frequency variables. Things that happen almost every day (e.g., symptoms that are there by default for disorders like ADHD; daily medications) will end up having extremely high correlations, but this is obviously unhelpful. What matters more in these situations is: which factors are notably different from usual on (or before) the days when the high-frequency outcome variables are reduced?

  2. Not taking timing into account. If I have a few weeks of increased symptoms, and the symptom spike started a day or two before a new factor was introduced or increased, it’s OBVIOUSLY a coincidence (if it’s impossible for the symptoms to have caused the factor). But if I don’t remember which came first, I’d have to go back and look day-to-day at both variables to see whether the correlation is coincidental or not. It would be better if the correlation could be flagged or discounted in some way to indicate that one of the two started before the other. I’ve even had this happen for symptoms that have started over a week before the new factor was introduced: for instance, when I left school and didn’t have a job, I was obviously more depressed; I started a new medication a few weeks after that, and the medication ended up being correlated with the depression — but it clearly couldn’t have caused it.

  3. Sample size mismatch. Obviously if a new or rare factor is introduced, and then there is a sudden change in a rare or stable outcome variable, this would be important to note. For instance, if I rarely have migraines, and I rarely have caffeine, but one day I drink caffeine and the next day I get a migraine, that’s very useful information. However, things that happen rarely should NOT be shown to have very high correlations with outcome variables that occur and/or change very frequently; and similarly, things that happen frequently shouldn’t be strongly linked to outcome variables that occur rarely. I know I can look at the number in parentheses to see how many times the factor occurred, but it’s still confusing to see an antibiotic that I took for a week at the top of the list just because it happened to coincide with an increase in depression, when my depression fluctuates all the time. It’s pretty unlikely that something I took for a week is going to make a big difference in a symptom that changes a lot normally. Speaking of which…

  4. Not accounting for long-term patterns and fluctuations. Sometimes I’ll start a new drug for a few weeks, and that will happen to coincide with an increase or decrease in a symptom that frequently fluctuates already, with changes that can sometimes last for a week or two. These coincidences will show up as correlations when it’s obvious that the factors are unrelated (for reasons specific to each case). It would be much more helpful to know when a factor is connected to an unusual pattern of symptom fluctuation. For example, is it consistently high/low for several days in a row when it usually goes up and down day-to-day? Is it consistently high/low across the course of a single day when it usually fluctuates hour-to-hour?

So yeah… sorry I don’t have better suggestions for how to prevent these, but I figured I’d point out the reasons I tend to find correlations unhelpful in case any of them could be addressed statistically. Obviously I don’t think it’s fair to rule any of these out completely, since they may indicate real relationships in certain circumstances, but I think they should at least be marked or mitigated in some way proportionate to the the span of time between the start of the variables, the amount of normal fluctuation in each, the degree of sample size mismatch, etc.

2 votes

Tagged as Suggestion

Suggested 13 December 2022 by user Anna Sedlacek