We want the theorycrafting we encounter to be correct, but is that all we should want? This post explores another virtue of good theorycrafting, trustworthiness.
It is hard to put faith in a result, no matter how precise or appealing, if we have no idea how a person arrived at it. We want theorycrafters to describe their methods so that we can have some confidence that their results are reasonably-obtained rather than the product of chance or mistake. Real-world scientists have the same expectation in their fields and, quite understandably, have put a lot of time into articulating what things make a result trustworthy. They use the following terms:
Valid
This is a rather expansive concept for a five-letter word. Broadly speaking, the concern of validity deals with how well a test has been constructed. It can be broken down into internal validity, which regards whether the proposed cause is really leading to the proposed effect, and external validity, which regards whether or not the test is representative of the phenomenon in question. There are also the ideas of construct validity and content validity, which relate to whether the test is measuring what the tester thinks it is measuring.
What the concern of validity sums to, then, is the desire to have a test be in all ways a good and reasonable test of the phenomenon in question. If a test fails on just one dimension of validity it can ruin the whole thing. For example, even if you otherwise perform the absolute best target dummy testing possible but fail on construct validity by comparing a Careful Aim build with a non-CA build, the whole thing may have to be thrown out. (Careful Aim is problematic for most testing on target dummies because, owing to the dummies always being over 90% health, it skews results in favor of the CA build.)
Because validity is such a huge concern, good theorycrafting should include a long enough description of the methods and math used such that the reader is assured that test is appropriate and properly executed. This can require a long description for some tests, but these days we have some significant shortcuts at our disposal. For example, linking to a femaledwarf profile can remove the need for a lot of the descriptions of settings. Bloggers and forum regulars also have the advantage that they can link to previous posts explaining their methods if they are repeating an old test for a new tier.
Reliable
We want a result to persist through repeated testing. Put another way, we want a result to be generally true rather than only true of one run of the test where some fluke or trick of RNG made an outlier result. This is why good theorycrafting often involves many iterations on a simulator (such as with SimulationCraft) or extended periods of time on a target dummy; the goal is to get any spikes or troughs in results to average out over lengthy testing. Thirty seconds on a target dummy is no way to test something considering how much your results may vary away from the “true” average due to crits and procs.
Replicable
This is the idea that another tester should be able to reproduce your results. Replicating an experiment allows scientists confidence that a result was not the product of a particular tester’s idiosyncracies and is instead valid in a general sense. Replication recently allowed scientists to confirm that neutrinos can travel faster than light, but a lack of replicability was also the downfall of a famous result in the search for cold fusion.
Replication additionally allows for conversation and advancement within a scientific or theorycrafting community. A test being replicable allows testers to repeat and build off each other’s work, achieving better results than a single person working alone would be able to attain.
Together, these virtues of good testing combine toward this conclusion: good theorycrafting involves explaining your work well enough that other people can have faith in it as being generally true of the phenomenon in question and can replicate the test if they want to.
Unfortunately, thorough explanations of methods and math are not as common as we might like in the WoW community. It takes time to write them out, it makes for a potentially boring TL;DR (too long, didn’t read) wall of text in a post, it detracts from the flow of a post, and it opens up the tester to scrutiny that he or she might not want to be under. Frostheim is a prominent example of someone who does not overtly show his work. I suspect, given his audience of hunters who probably just want quick answers, that he does not want to bog down his guides and posts with long explanations of methods. It also certainly saves him significant amounts of time, given the number of guides he maintains. But it also deprives the hunter community of some confidence in his results; we have to take him on faith rather than being able to affirm that his methods are sound. It also precludes the community-improving dialogues that might result from explicit methods and the exposure that many hunters might want for starting their own theorycrafting. A partial solution to the inelegance of explanations of metods and math, at least for bloggers, is the use of appendixes. Testing explanations could be added on to the end of a post to offer the details to the curious but would not get in the way of the main post’s message.
In advocating for transparency in theorycrafting I am not trying to lecture from on high. I am not a perfect theorycrafter by any means, and I make mistakes with a depressing regularity. Indeed, all theorycrafters make mistakes. That is why theorycrafters explaining themselves with an eye to validity, reliability and replicability is so important. We are all human and prone to error, and it is only through an enterprise of transparency and communal scrutiny that we will achieve the best results.
Thanks for another interesting article.
One other comment on validity is that results often appear to have spurious accuracy. So a result will be exactly, say, 37521. Not 37520 or 37522.
This isn’t really correct given DPS isn’t an absolute value but a collection of values subject to variation due to rng and situational circumstances.
Ideally the results of theorycrafting should be a mean DPS number plus some indications of the range of values around this mean. This range may well vary, for example a build that is highly dependent on crits may have a wider range than one that is not.
Some of the things you mention are the same reasons I don’t use femaledwarf. The spreadsheet model is by definition invalid because of the use of averaging and assumptions to make it run quickly. The source is not available so I can’t verify its methods. I wish theorycrafters would focus their efforts on simulationcraft instead, as it provides transparency, an authentic model of the game, and many more tools such as the calculation of error ranges so you can tell whether your results are statistically significant when making comparisons, plots of how stat values change over a range, and extremely detailed reports that make it easier to analyze what’s going on.
sadly, this all comes to an end whe MoP comes along… 15 lousy choices and all pet talents gone… merely a small matrix of equations to deal with… matrix algebra… maybe some min/max optimization… but nothing complicated as today…. *sigh*
I’ve experimented with a couple of different ways of presenting information, and by leaps and bounds the TL;DR factor smacks everything else in the head. The *vast* majority of people looking for guides online do not want to read the theorycrafting explanation (which is usually many times longer than the advice). At this point you have conflicting pulls: the good science part wants to explain the process, but the good blogger part wants to provide the information in a format the readers want.
Ultimately the bloggers that follow the vastly more popular option will be the only ones with an audience (which is every hunter blogger I know — I’ve never read a rotation or talent guide that explained the theorycraft for every choice, ever).
That said, I occasionally show the math or science either as a glimpse into how theorycrafting works, or when refuting a particularly popular sentiment. And I maintain the policy that if someone disagrees and shows their math, then I’ll break out my math and we can discuss it — which has indeed led to proving I was had overlooked something significant and made me revise things.
So while I occasionally exchange dull, detailed theorycrafting emails with Zeherah (especially during beta) where we discuss technique and mathematical formulas, doing that on my blog would just drive people to look elsewhere for their answers. The appendix idea is interesting, but also problematic (don’t want another 3k words separating your post from the comments, don’t want to clutter your RSS feed with stuff 98% of readers don’t want), not to mention you’re then spending 3 times as much time on writing stuff most people don’t want to and won’t read.
I think overall the problem isn’t an issue of what’s the best science, so much as a problem of what ultimately makes for a good site.
I agree! With both of you.
As someone with the tendency to fully detail where my theorycrafting results originate and the method in which they are determined, I am all for transparent and full disclosure. Hence, my MM guide on EJ is a giant catalog. At the same time, I am also aware of how the message can get lost in the words and have gotten good feedback on that.
That is why I am now the proponent of a guide having two “views”. One is the actual guide detailing the suggestions and various options with adding in some high level explanation and reasonings. That way the reviewer can get what they need relatively quickly.
The guide can then have links to other documents that provide the detailed logic and process for determining the suggestions for those who are interested in seeing the thoerycrafting. Updates to the MM guide have been trying to do that a little.
Next time around if I write the guide on EJ again, I will follow this new approach with limiting the guide to just the suggestions and high level rationale with links to the theorycrafting and other details if folks are interested.
The ability to back up claims with objective, valid evidence is what science is all about — it’s not so important whether the evidence is presented in an appendix or via request. (The format and the intended audience often dictate which method is preferable.)
Documenting all the methodology and justification is a ton of work, so all of us who recognize that definitely appreciate the diligence of Frostheim, Whitefyst, and others who maintain it. :)