This is the group project of SDSC1001 – Introduction to Data Science. I did the project in my year 1 2020/21 Semester B.
Presentation Slides:
Course Instructor: Dr. Xinyue LI
Abstract: People spend more time staying at home under the pandemic of Covid-19. Playing video games has become one of the entertainments. As a result, the number of players has significantly increased. This paper is to investigate the correlation between factors and hit rate by linear regression and the machine learning algorithm for predicting game sales and success. It is hoped that the factors and algorithms could be a reference to predict future game sales and success.
1. Background, Problems and Motivation
1.1 Background
During the Covid-19 pandemic, many countries have set up disease prevention measures to prevent virus transmission. For example, the United Kingdom has imposed the coronavirus lockdown. People have more time staying at home due to government policy and health situations. Under these circumstances, more people spend time playing video games and gaming companies are willing to put resources into introducing new games. Some researchers state that the number of concurrent active users has more than 20 million during the pandemic, which is the highest according to the record from one of the video game distributors, Steam. As a result, the number of video game players has dramatically increased. King et al. (2021) also indicate that the pandemic has facilitated the participation of gaming. It seems that the pandemic might be a significant factor that affects the video game hit rate; however, there might be other factors that could affect the hit rate of video games. Thus, this paper aims to visualize whether the factors could affect the video game hit rate and further investigates the machine learning algorithm, which is appropriate for predicting video game sales and game success.
1.2 Definition
Game hit rate
- Total number of players
- It means the number of players starting from the first day of opening the game to now.
- Average concurrent players
- It means the average number of online players in a specific time.
- Peak concurrent players
- It means the highest number of online players in a specific time.
- Daily/Weekly/Monthly active players
- It means the number of players that logged to the game for at least once in a day, a week or a month.
Video game
- Electronic games in which players control images on a video screen
Linear regression
- A methodology establishes a relationship between a scalar and one or more variables.
Y = aX + b ...
Y is result variable
X is predicted variable
a and b are coefficients
Support vector regression
- A machine learning algorithm is used for classification from Support Vector Machine. It decides the decision boundary lines to forecast a continuous variable.
Y = aX + b ...
Decision boundaries:
aX + b = -e
aX + b = +e
Model satisfaction:
-e < Y - aX + b < +e
Random forest
- A supervised machine learning algorithm randomly generates a forest with several trees. The classification or regression can be adopted to construct a decision tree and output class or prediction.
Precondition: A training set S := (x1, y1), . . . ,(xn, yn), features F, and number of trees in forest B.
function RandomForest(S , F)
H ← ∅
for i ∈ 1, . . . , B do
S (i) ← A bootstrap sample from S
hi ← RandomizedTreeLearn(S (i) , F)
H ← H ∪ {hi}
end for
return H
end function
function RandomizedTreeLearn(S , F)
At each node:
f ← very small subset of F
Split on best feature in f
return The learned tree
end function
Decision tree
- A supervised machine learning algorithm predicts a target variable. It is used in classification and prediction problems with mainly “if-then-else” statements.
Internet popularity
- Number of people who accessed the Internet.
2. Objectives
- Correlation between price and number of players in a game
First, we assume price is an important factor of the hit rate of a game. Our validation method is to use linear regression to find out the correlation between price and number of players. We use Counter-Strike: Global Offensive (CS: GO) data to do the testing as the game company (Steam) provides the players’ number api so that the public can easily access those information for doing investigation. We will need two datasets, one contains the history price of CS:GO from August, 2012 to March, 2021. The other dataset is the one provided by Steam api. After we have downloaded the two sets data, it is a must for us to do data cleaning in
order to reduce noise for better accuracy. We found that there was some data loss in the price dataset such as the price of August 2012 to February 2013 were not being recorded. As a result, we can only compare two datasets from March 2013 to November 2018 because the game changed to free access to everybody in December 2018, so the price is changed to zero and not suitable for comparing the relationship of price and number of players by linear regression anymore. The calculation process of linear regression was finished at Excel. The full processed dataset can be found in Appendix.
- Correlation between internet popularity and video game sales
Our assumption is that video game sales has the correlation with internet popularity. In order to verify this assumption, we used the datasets of video game sales by country and the internet popularity of them. These two datasets are provided by Zach. The dataset of video games sales show the game revenue estimated in 2020 and the internet popularity shows the number of people who accessed the internet using any devices for each country this year. These estimates are based on consumer research, transactional data, quarterly company reports, and census data. The revenues are based on consumer spending in each country and exclude hardware sales and tax. Our testing method is to use linear regression to test its correlation by excel.
3. Main Results
- Correlation between price and number of players in a game
After the two datasets were combined together and processed with linear regression from Excel, here is the exact result after processing.
SUMMARY OUTPUT | |
Regression Statistics | |
Multiple R |
0.945345122 |
R Square |
0.864456354 |
Adjusted R Square |
0.874734625 |
Standard Error |
10051.16955 |
Observations |
69 |
As the table upon, we can see that Multiple R is around 0.94 which means there is a strong positive relationship between price and number of players of the game. R Square is 0.86 which means 86% of our values fit the regression analysis model.
- Correlation between internet popularity and video game sales
By combining two datasets and doing the linear regression analysis, we got some results about the correlation.
Equation: |
y=(0.000000051)x + 45.6 |
Multiple R |
0.844716 |
R square value |
0.713545 |
Adjusted R square |
0.677738 |
Standard Error |
0.0000000829 |
Observations |
20 |
We can see that the slope of the equation is positive and the R-square value is about 0.71, which means that internet popularity has a positive relation with video games sales and 71% of the variance in the video games sales can be explained by internet popularity.
- Accuracy of predictive algorithms for game sale
ML algorithm |
Accuracy |
Random Forest |
0.9605 |
Support Vector Regression |
0.8154 |
Decision Tree |
0.8036 |
Linear Regression |
0.7534 |
Source: Keerthana, B. & Rao, K.V.(2019). Sales Prediction on Video Games Using Machine Learning. Journal of Emerging Technologies and Innovative Research, 6(6).
- Accuracy of predictive algorithms for game success (number of players)
ML algorithm |
Accuracy |
Random Forest |
0.9750 |
Support Vector Regression |
0.9640 |
Source: Trněný, M. (2017). Machine Learning for Predicting Success of Video Games. Masaryk University Faculty of Informatics.
4. Interpretation, Comparison, Discussions
- Correlation between price and number of players in a game
From the result of linear regression, it is not hard to notice that the correlation between price and number of players in a game exists. It has proved our assumption that price is an important factor of the hit rate of a game. There are some reasons that make this correlation happen. When a higher price game is being made, large studios are aiming at high-quality products’ market so their budget allows them to invest into advertising in order to get a larger audience to notice the game. As a consequence, while most games cost $10 and less, the more successful ones can be generally found above $40 (Trněný, 2017). (check Figure 1 below)
On the contrary, the game that we use for testing (CS: GO) shows a different phenomena as well. After the game was changed to a free game on 6th December of 2018, the number of new players increased a lot more than before. It shows that people become more willing to give it a try as it is free. To conclude that, price has a significant effect on the hit rate of games, no matter if the price is high or not.
Figure 1
Source: Trněný, M. (2017). Machine Learning for Predicting Success of Video Games. Masaryk University Faculty of Informatics. https://is.muni.cz/th/k2c5b/diploma_thesis_trneny.pdf
- Correlation between internet popularity and video game sales
There is a phenomenon that countries with higher internet popularity would have higher video games sales. It can be explained by the market size. Higher internet popularity means that there are more people who have suitable devices to play video games and receive information about it. Also, the variety of opponents we can face online would cause players to have a fresh and unique challenge. These are the reasons for the positive correlation.
Optimal machine learning algorithm for predicting sale or success of games
- Sale prediction
Linear regression, support vector regression, random forest and decision tree models were constructed to predict the games sale (Keerthana, B., Rao, K.V. & Scholar, M.T., 2019). The linear regression was the baseline model which was selected by the research team. Among the 4 testing models, the accuracy rates were 75%, 85%, 96% and 80% respectively in linear regression, support vector regression, random forest and decision tree models. The optimal ML algorithm for predicting sale of games is the random forest model which owned 96% of accuracy.
- Success prediction
Support vector regression and random forest provide effective prediction models which have 96% and 97% of accuracy respectively (Trněný, 2017). In the research, one of the objectives is to predict the average number of players in the game. The model of predicting the game over 100 players on average worked in those 2 algorithms that it covered 33% of the dataset. Hence, the researcher believes that the models are beneficial to game producers or developers to have an useful insight for their products.
5. Conclusions
The primary aim of this paper is to investigate whether the factors could affect the video game hit rate. Based on the visualization results, it has following conclusion:
- There is a strong relationship between game prices and number of players.
- The relationship between internet popularity and video game sales is positive.
The phenomenon in (1) could be explained by the mentality of the players. As Alha et al. (2014) state, people are more willing to try free-to-play games. High price games and low price games might not have differences for those players. They could gain similar amusement even by playing a low price game. In this situation, low-price games could attract a part of players; however, some players are willing to pay for high price games since they might believe that high price games could provide high-quality gaming experiences. For example, it might contain a rich story. As such, there is a market for both high and low price games.
The paper also aims to figure out the optimal machine learning algorithm for video game sales and game success prediction. After investigation, the random forest model seems to be the most effective machine learning algorithm for predicting video game sales and game success, which has a 96% accuracy rate in predicting game sales and approximately 97% in predicting game success.
In conclusion, the price of video games and internet popularity are the factors that would affect the hit rate of games. A suitable machine learning algorithm is also essential for gaming companies to predict future game sales or success. It is hoped that the paper could be a reference for future prediction.
References
Alha, K., Koskinen, E., Paavilainen, J., Hamari, J., & Kinnunen, J. (2014). Free-to-Play Games: Professionals’ Perspective. Proceedings of DiGRA Nordic 2014.
Keerthana, B. & Rao, K.V. (2019). Sales Prediction on Video Games Using
Machine Learning. Journal of Emerging Technologies and Innovative Research, 6(6). http://www.jetir.org/papers/JETIR1907H50.pdf
King, D. L., Delfabbro, P. H., Billieux, J., & Potenza, M. N. (2020). Problematic online gaming and the COVID-19 pandemic. Journal of Behavioral Addictions, 9(2), 184-186.
Şener, Mehmet & Yalcin, Turkan & Gulseven, Osman. (2021). The Impact of Covid-19 on the Video Game Industry.
Trněný, M. (2017). Machine Learning for Predicting Success of Video Games. Masaryk University Faculty of Informatics. https://is.muni.cz/th/k2c5b/diploma_thesis_trneny.pdf
Appendix
Price and number of players in CS: GO dataset
Month |
Avg. Players |
Gain |
% Gain |
Price (HKD) |
March 2021 |
740927.82 |
-85.42 |
-0.01% |
0 |
February 2021 |
741013.24 |
-2196.42 |
-0.30% |
0 |
January 2021 |
743209.66 |
25405.91 |
3.54% |
0 |
December 2020 |
717803.75 |
49049.17 |
7.33% |
0 |
November 2020 |
668754.58 |
55087.89 |
8.98% |
0 |
October 2020 |
613666.69 |
6816.37 |
1.12% |
0 |
September 2020 |
606850.32 |
-33107.34 |
-5.17% |
0 |
August 2020 |
639957.66 |
14056.85 |
2.25% |
0 |
July 2020 |
625900.81 |
-45746.65 |
-6.81% |
0 |
June 2020 |
671647.46 |
-97147.79 |
-12.64% |
0 |
May 2020 |
768795.25 |
-88808.97 |
-10.36% |
0 |
April 2020 |
857604.22 |
186570.94 |
27.80% |
0 |
March 2020 |
671033.29 |
127054.13 |
23.36% |
0 |
February 2020 |
543979.15 |
42783.15 |
8.54% |
0 |
January 2020 |
501196 |
44494.44 |
9.74% |
0 |
December 2019 |
456701.56 |
30620.76 |
7.19% |
0 |
November 2019 |
426080.81 |
17085.5 |
4.18% |
0 |
October 2019 |
408995.31 |
-1930.29 |
-0.47% |
0 |
September 2019 |
410925.6 |
-4171.7 |
-1.00% |
0 |
August 2019 |
415097.3 |
21314.48 |
5.41% |
0 |
July 2019 |
393782.83 |
4406.1 |
1.13% | 0 |
June 2019 | 389376.72 | 24959.42 | 6.85% | 0 |
May 2019 |
364417.31 |
12427.39 |
3.53% |
0 |
April 2019 |
351989.92 |
-38250.24 |
-9.80% |
0 |
March 2019 |
390240.16 |
18881.2 |
5.08% |
0 |
February 2019 |
371358.96 |
-30007.91 |
-7.48% |
0 |
January 2019 |
401366.87 |
5857.61 |
1.48% |
0 |
December 2018 |
395509.26 |
85423.83 |
27.55% |
0 |
November 2018 |
310085.43 |
-15822.39 |
-4.85% |
99 |
October 2018 |
325907.82 |
-7256.17 |
-2.18% |
76 |
September 2018 |
333163.99 |
49632.68 |
17.51% |
32 |
August 2018 |
283531.31 |
10224.05 |
3.74% |
48 |
July 2018 |
273307.26 |
6445.02 |
2.42% |
76 |
June 2018 |
266862.24 |
4691.36 |
1.79% |
76 |
May 2018 |
262170.88 |
-26905.82 |
-9.31% |
120 |
April 2018 |
289076.7 |
-65193.64 |
-18.40% |
120 |
March 2018 |
354270.33 |
-28186.77 |
-7.37% |
110 |
February 2018 |
382457.1 |
426.57 |
0.11% |
99 |
January 2018 |
382030.53 |
41153.65 |
12.07% |
49.5 |
December 2017 |
340876.88 |
19745.48 |
6.15% |
60 |
November 2017 |
321131.4 |
-20729.86 |
-6.06% |
120 |
October 2017 |
341861.26 |
-12540.83 |
-3.54% |
99 |
September 2017 |
354402.09 |
-20023.6 |
-5.35% |
99 |
August 2017 |
374425.69 |
-3163.35 |
-0.84% |
99 |
July 2017 |
377589.04 |
3201 |
0.85% | 76 |
June 2017 | 374388.04 | 2558.7 | 0.69% | 76 |
May 2017 |
371829.34 |
-20369.85 |
-5.19% |
99 |
April 2017 |
392199.19 |
5290.47 |
1.37% |
76 |
March 2017 |
386908.72 |
-15476.99 |
-3.85% |
99 |
February 2017 |
402385.71 |
9276.18 |
2.36% |
48 |
January 2017 |
393109.53 |
50913.83 |
14.88% |
38 |
December 2016 |
342195.7 |
13150.44 |
4.00% |
48 |
November 2016 |
329045.26 |
-4031.2 |
-1.21% |
99 |
October 2016 |
333076.46 |
10550.57 |
3.27% |
48 |
September 2016 |
322525.89 |
-24703.36 |
-7.11% |
120 |
August 2016 |
347229.25 |
-6548.31 |
-1.85% |
70 |
July 2016 |
353777.56 |
19466.5 |
5.82% |
49.5 |
June 2016 |
334311.06 |
-4427.34 |
-1.31% |
70 |
May 2016 |
338738.39 |
-37057.47 |
-9.86% |
120 |
April 2016 |
375795.87 |
-3631.08 |
-0.96% |
70 |
March 2016 |
379426.95 |
3141.92 |
0.83% |
66.33 |
February 2016 |
376285.02 |
10913.93 |
2.99% |
44.5 |
January 2016 |
365371.09 |
-12076.02 |
-3.20% |
99 |
December 2015 |
377447.11 |
16521.23 |
4.58% |
49.5 |
November 2015 |
360925.88 |
-1840.21 |
-0.51% |
62 |
October 2015 |
362766.09 |
6860.76 |
1.93% |
48 |
September 2015 |
355905.33 |
-1629.91 |
-0.46% |
62 |
August 2015 |
357535.24 |
28002.87 |
8.50% |
48 |
July 2015 |
329532.38 |
-14623.63 |
-4.25% | 99 |
June 2015 | 344156.01 | 26869.72 | 8.47% | 32 |
May 2015 |
317286.29 |
25537.55 |
8.75% |
32 |
April 2015 |
291748.74 |
23752.43 |
8.86% |
32 |
March 2015 |
267996.31 |
28061.68 |
11.70% |
32 |
February 2015 |
239934.64 |
5863.96 |
2.51% |
48 |
January 2015 |
234070.68 |
50481.18 |
27.50% |
25 |
December 2014 |
183589.5 |
36260.43 |
24.61% |
25 |
November 2014 |
147329.07 |
13791.37 |
10.33% |
48 |
October 2014 |
133537.7 |
2503.02 |
1.91% |
99 |
September 2014 |
131034.68 |
-2151.11 |
-1.62% |
78 |
August 2014 |
133185.79 |
27047.79 |
25.48% |
32 |
July 2014 |
106138 |
21974.38 |
26.11% |
32 |
June 2014 |
84163.62 |
-761.4 |
-0.90% |
78 |
May 2014 |
84925.02 |
6044.49 |
7.66% |
68 |
April 2014 |
78880.53 |
8737.2 |
12.46% |
68 |
March 2014 |
70143.33 |
10351.96 |
17.31% |
68 |
February 2014 |
59791.37 |
4164.01 |
7.49% |
68 |
January 2014 |
55627.35 |
8839.08 |
18.89% |
68 |
December 2013 |
46788.27 |
16897.75 |
56.53% |
68 |
November 2013 |
29890.52 |
1988.38 |
7.13% |
68 |
October 2013 |
27902.14 |
223.23 |
0.81% |
68 |
September 2013 |
27678.91 |
1717.06 |
6.61% |
68 |
August 2013 |
25961.85 |
5469.33 |
26.69% |
68 |
July 2013 |
20492.52 |
2372.68 |
13.09% |
68 |
Video game sales and internet popularity dataset
Country | Internet Popularity (million) | Sales (million) |
China | 907.5 | 4085.4 |
United State | 283.9 | 3692.1 |
Japan | 101.5 | 1868.3 |
South Korea | 48.2 | 656.4 |
Germany | 75.5 | 596.5 |
United Kingdom | 61.8 | 551.1 |
France | 58.2 | 398.7 |
Canada | 33.7 | 3051. |
Italy | 52.7 | 266.1 |
Spain | 40.8 | 265.6 |
Russia | 65.2 | 200 |
Sweden | 42 | 1800 |
Finland | 38 | 1734 |
Australia | 29 | 143.5 |
Mexico | 27.65 | 126.5 |
Singapore | 3.145 | 159.3 |
Vietnam | 7.815 | 364.2 |
Poland | 23.3 | 123 |
Netherland | 35 | 213 |
Slovakia | 33 | 165.6 |