What is A/B Testing? – Bayesian Approach vs Null Hypothesis


  • General lifecycle of development journey
  • Basic approach of A/B Testing
    • Also A/B/C Testing…
  • What can be tested? What is noise?
  • What is Null Hypothesis and p-value?
  • What is Bayesian Approach and prior?
  • Player and time independency factor

We -as game developers- try to publish our games as fast as possible, most the time. First, we have development phase which ends up with Beta version. This is also called as “Soft Launch”. Then we test our beta and continue our development which ends up with the “Launch” title. Then our development journey continues as service: dev + test: release, dev + test: release, …

            This development journey consists of our bad and good design choices which can be “approached” with several proxies. Whatever design choice we made, we should get enough (or hopefully more) acquisition from our efforts.

            Mr. Steven Collins from Swyrve lights one important point that our most of revenue comes in first 30 days of installation. This number is so high which is 50% of your revenue. Therefore, we can make good design choices, iterations, and “tests”; if we want to get good revenue.

            Our general lifecycle or game is formed by three main points: Understand, Test Hypotheses, Take Action; then repeat.

Understand: We use metrics to analyze and understand users’ behaviors.

Test Hypotheses: Then test our reasonable hypotheses.

Take Action: Finally, we choose to take or leave our hypotheses.

            A/B Testing is also similar to this approach:

  1. We split our players’ into different groups
  2. Different groups play different versions of our game
  3. We analyze and measure test results
  4. In case of huge difference, we choose our winner approach
  5. Then we deploy our winner choice to the whole our players

We have different servers: One is our actual game server and other one is for A/B testing. Our A or B version is published to the users that we want to test. Here comes a big question: What can we test? Answer is simple: everything; even buttons’ locations in UI. Let’s discuss one example of A/B testing:

You have tutorial at the beginning of new game. This tutorial has 10 stages; however, people tent to skip the tutorial at the stage of 5. This should give you a clue that something is wrong about the tutorial; maybe it is too hard, too boring, too long, etc. You can prepare two variations of this tutorial: one with same and 5 stages, second with easier parts with 10 stages. Then you analyze results again to try to understand the problem.

Game economy is another topic to discuss, but you can use A/B testing in game store too. You can give same package with same (let’s say $5) price but exhibit them with different discounts (let’s say A: 50% discount ends up with $5; B: 20% discount ends up with 5$). Some discount rates can look like fishing and can disturb players. You also can test different package prices (A: 100 gems for 5$; B: 150 gems for $5) to see their effects. For not breaking the equality between players, you can test something like equal and see how different looking prices affect players’ decisions (A: 100 gems $5; B: 150 gems $7.5).

Figure 1:VIPs repeats their shopping at $70. Graph from[1]

            Bottom part of graph shows first purchases. You can see some players continue to buy packages from $70. With A/B testing you can change $15 to 16 and 14. Then see how your users react to those changes. Do not forget that so little changes can have huge impact on players’ choices.

            Let’s discuss Call-to-action choices. Biggest mistake can be asking player if s/he want to rate the game immediately after installation. “Do you want to share this screenshot in Twitter?”, “Do you want to watch a video to double your reward?” are just two examples of actions can be tested. Best thing is that: Never forget that you must not have the best idea about where to put the call-to-action; instead, you have multiple opinions about where to put these. You should try them all to see when and where are better for your users.

            Noise and wrong analyzation are your two enemies while doing A/B testing. You cannot say test A is better than test B after two days; maybe 1 week later, B will dramatically pass A. You should have mathematical model to understand what your data from users “actually” shows. What is hidden beneath the data? You should find answer to when your data shows the accurate results.

            Conversion rate is another basic topic that needs to be considered. If I explain this term with example: if 50% of your users come back and play your game on day 2, then your conversion rate for day 1 is 50%. This can show how effective your “welcome” design on your players. Let’s assume that you made a big change with A/B test and your conversion rate for 70 players is 10% now. Is this bad? How should we approach to this result?

            Null Hypothesis Testing: In Null Hypothesis, it is accepted that different entities have no impact on each other. Therefore, one modification on A does not have any consequence on B. Here is the good figure that examples definition:

Figure 2: Figure from[2]

According to Null hypothesis, there is no relationship between day 1 conversion rate and day 2 conversion rate. What we need to do is dispute this. At the starting, we do not have enough information to dispute it, so we accept it. If the result is off the chart like 0%, we can start thinking the hypothesis is wrong.

Figure 3: Figure from[3]

            But what this 30% conversion rate tells us? It just tells that it was not sustainable. Nothing more. Dark orange parts of graphic shows us p-values. These are extreme values of graphic. If our conversion rate is in these areas, we can reject the hypothesis. P-value can also be seen as probability of concluding with extreme result. Let’s say that p value is between 0 and 1. If it is less than 0.05, we see this as big change. On the other hand, the probability of seeing this result is less than 5% according to this hypothesis. As you see, P-values are hard to approach with intuitions as game designers; instead, we need something that says us our design choice is good.

            Another problem with this approach is noise: while you are using this model, when you reach below 5%, you can say “yeah this is what I want”. But it can only be noise, false positive.

            We also have Family-wise error topic which says that increased number of treatments leads us increased false positive rate.

            Better approach with A/B testing is Bayesian Approach. With the example given before, we can say that our 30% conversion rate is our “retention” rate. We know that -without any changes- we can reach this rate. Then we build our test upon this data. Bayesian approach gives us “probability” of model based on given data.

Figure 4:Tossing Coin. Figure from[4]

Let’s try to clear this approach with classical example: tossing coin. We simply have two opportunities when we toss: head and tail; and we have 0,5 probability for each. In Bayesian Approach -even for this basic example-, we have noise at the beginning of experiment as you can see in figure 4. Getting more observations is essential step to getting closer to accuracy as much as possible.

            We need to clear that we want possibility of model according to given data, not possibility of data. In other words, we want to “clearly” say that our conversion rate is 30%. Mr. Collins gives valuable example to clear this point:

Figure 5: Figure from[5]

            Probability of being cloudy while it is raining is not same with the probability of raining while it is cloudy. These two clearly are not the same and we want to find right side from the given left side.

Figure 6: Figure from[6]

            We continue with more iterations and every iteration leads us better result. We do this repeatedly until we reach “actionable” certainty. We increase our certainty with every repetition: more data.

            Setting the “prior” is another step to consider carefully. Prior is our starting point which is confidence point. While experiment is continuing to repeat, we are getting further or closer to our prior point. This shows how much accurate our first setup is.

            There is another advanced version of A/B test which is done with three versions of game. In this case we have volumetric graph in which we deal with four three dimensions.

            Mr. Collins talks about general assumptions that we need to consider:

  • Users are independent from each other
    • This assumption is not always good especially if you are running a test for multiplayer, team-based games.
  • Probability of conversion is not dependent of time
    • This is also not so accurate approach; basically 8 o’clock in the morning will give you so different results instead of 1 PM on Sunday.

Some of benefits and features of Bayesian Approach are:

            You can continuously observe from your graph. Population size is not fixed during the tests. We have the precious term called prior in which we can rely on our previous experiences and/or knowledge. We have accurate probability. We have opportunity to consider magnitude of difference. Lots of different situations can be adapted to this approach.


[1-3-4-5-6] are also from video above— https://www.youtube.com/watch?v=-OfmPhYXrxY&feature=youtu.be

[2] — https://www.thoughtco.com/thmb/ayMTs7HtvLoJHeWqqN7C7a-l9Oo=/1333×1000/smart/filters:no_upscale()/null-hypothesis-examples-609097_FINAL-100262e70b70426fb0633304eb2f49f4.png

Leave a Reply

Your email address will not be published. Required fields are marked *