This is the group project of SDSC2004 – Data Visualization. I did the project in my year 1 2020/21 Semester B.
Presentation Slides:
Course Instructor: Dr. Yu Yang
1. Introduction
Due to the excessive growth of Hong Kong annual mean temperature (Figure 4), it is quite the same as the rise of power consumption in Hong Kong. Albeit somewhat similarly, it may not have any correlation between temperature and electricity consumption of Hong Kong. It seems possible to find it out by using linear regression as it is assumed that a relationship between those two factors exists. In this report, there will be mainly two datasets used, one is the average temperature of every month from 1970 to 2020 in Hong Kong by DATA.GOV.HK, and the other is the number of electricity usage including domestic, commercial, industrial from 1970 to 2020 in Hong Kong by the census and statistic department. They are all provided by the open API of the Hong Kong Government. This report will mention the method of getting data, how it can be processed, the challenges that occurred, and how it can be solved while analysing the data. Nevertheless, most importantly, it will mention the result of the linear regression of temperature and electricity consumption of Hong Kong.
2. Data pre-processing
2.1 Data cleaning
The temperature dataset from the government (Appendix) contains unknown variables. As can be seen in Figure 1, the temperature value and temperature unit contain mysterious values or units, such as the star (***) and hashtag (###) symbols. Under these conditions, it shows an error when the average temperature is calculated through the program.
As such, data-cleaning has been done for solving the phenomenon, which include the following actions:
(1) The unknown star symbol (*) has been removed.
(2) The inconsistent data have been corrected. For example, the hashtag symbol (#) was
changed to the degree Celsius (C).
(3) The missing temperature value has been filled in after the unknown data is removed
2.2 Data Integration
Temperature datasets from different weather stations from 1970 to 2020 (Appendix) is required. However, the datasets are not capable of being downloaded completely due to call limit protection in the API of the HKO site. It is needed to call the dataset separately to override the call limit protection. Further integration is required to merging the two dataset files into one by using coding (Figure 2).
2.3 Data Transformation
2.3.1 Normalization
In the datasets, there are different types of data which have different units. In addition, the range of the numbers are also different. It may result in difficult to calculate the data directly and hard to plot clear
graphs. In order to solve the problem, we use normalization. We use python to change the data into a similar format so that it will be easy to calculate and plot the graphs.
3. Data visualization & analytics
3.1 Average Temperature in Hong Kong Between 1885-2020
Hong Kong Temperature Between 1885-2015 graph indicates an increasing trend in Hong Kong average temperature between 1885-2015. On average, the increasing rate between 1885-2020 is 0.13℃/decade. And a steeper slope is reported between 1991-2020, the increasing rate between 1991-2020 is 0.24℃/decade. +33% when comparing the increasing rate between 1885-2020 and 1991-2020.
3.2 Data Visualization & analytics in Hong Kong Electricity Consumption and Temperature
To verify the correlation between Hong Kong electricity consumption and temperature, the calculation in P-values, coefficient of correlation (R-values) and coefficient of determination is needed.
- P= 6.673E-11(One-Tailed Test)
- R= 0.829927
- R^2= 0.688778825
The result is significant at p < .05, strong positive relation between electricity consumption and temperature in 1970 and 68.9% variation in electricity consumption that can be attributed to the temperature in 1970.
- P= 2.79013E-09 (One-Tailed Test)
- R= 0.958913
- R^2= 0.919514142
The result is significant at p < .05, strong positive relation between electricity consumption and temperature in 2020 and 92.0% variation in electricity consumption that can be attributed to the temperature in 2020.
- R-value mean : 0.851730294
The graph reveals a strong positive relation between electricity consumption and temperature between 1970-2020.
- Range : 0.958913-0.745304
There is little variance between R-values from 1970 to 2020, which indicates electricity consumption and temperature in every year follow a similar increasing trend.
- R^2 values mean: 0.727050259
The result indicates around 73% variation in electricity usage that can be attributed to the temperature.
- Range: 0.919514142-0.555478052
There is little variance between R^2 values from 1970 to 2020, which indicates electricity consumption and temperature in every year follow a similar linear relationship.
After calculation in P-values, coefficient of correlation and coefficient of determination, the datasets discover the temperature in Hong Kong has significant impact on the total electricity consumption in domestic, commercial, industrial. Furthermore the correlation is considered strong ,positive and linear. In short summary, total electricity consumption in domestic, commercial, industrial scales with the increasing in temperature.
4. Conclusion
In a nutshell, after the linear regression of the two factors (temperature and electricity consumption) is finished, the mean R-value of the temperature and electricity consumption from 1970 to 2020 is around 85% which means there is a strong positive relationship between temperature and electricity consumption. The mean r square value is 73% which means 73% of data follow the linear regression model. As a result, it proved our assumption that high temperature is approximately related to high electricity consumption in Hong Kong, and their correlation exists.
In this project, our group also found tons of problems during processing the data, such as the data included noise that is not fully complete or not very consistent. It requires us to use the technique from lessons. We found that some methods like normalization are advantageous in this case. Apart from that, cleaning and processing data requires us to build a better flow first, taking more time than we thought. Without this project, our group may not have that much experience to try out the fun and hard of processing data.
5. References
Observatory, H. K. (2021, April 20). Climate Change in Hong Kong – Temperature. |Hong Kong Observatory(HKO)|Climate Change. https://www.hko.gov.hk/en/climate_change/obs_hk_temp.htm
6. Appedix
Full code:
import pandas as pd
import requests
import os
import time
import matplotlib.pyplot as plt
def fetch_data():
"""
Fetches temperature data for various stations from the Hong Kong Observatory API and saves it to a CSV file.
"""
ROOTDIR = "/content/drive/MyDrive/SDSC2004_Project"
DATADIR = os.path.join(ROOTDIR, "dataset")
DATATYPE = "CLMTEMP"
STATIONS = ["SHA", "SKG", "SKW", "SSH", "SSP", "STY", "TC", "TKL", "TMS",
"TPO", "TU1", "TW", "TWN", "TY1", "TYW", "VP1", "WGL", "WLP", "WTS", "YLP"]
START_YEAR = 1970
END_YEAR = 2021
RFORMAT = "json"
errors = []
data = []
for station in STATIONS:
for year in range(START_YEAR, END_YEAR + 1):
time.sleep(5)
payload = {
'dataType': DATATYPE,
'rformat': RFORMAT,
'station': station,
'year': year
}
print(f'Getting data for station {station} in {year}')
url = f"https://data.weather.gov.hk/weatherAPI/opendata/opendata.php"
res = requests.get(url, params=payload)
try:
res_data = res.json()
temps = res_data.get('data')
for temp in temps:
temp.append(payload.get('station'))
data.append(temp)
except ValueError as e:
error = f"Missing: dataType = {payload.get('dataType')}, station = {payload.get('station')}, year = {payload.get('year')}"
errors.append(error)
df_temp = pd.DataFrame(data, columns=['year', 'month', 'day', 'temp', 'unit', 'station'])
df_temp.to_csv('./dataset/part2.csv', index=False)
def merge_csv(csv1, csv2):
"""
Merges two CSV files into one.
"""
part1 = pd.read_csv(csv1)
part2 = pd.read_csv(csv2)
full = pd.concat([part1, part2])
full.to_csv('./dataset/temp.csv', index=False)
def normalize_data():
"""
Normalizes temperature data by averaging temperatures for each month.
"""
temp = pd.read_csv('./dataset/temp.csv')
data = []
for year in range(1970, 2020 + 1):
for month in range(1, 12 + 1):
log = temp.loc[(temp['year'] == year) & (temp['month'] == month)].dropna()
data.append([year, month, pd.to_numeric(log['temp']).mean()])
df_temp = pd.DataFrame(data, columns=['year', 'month', 'average temp'])
df_temp.to_csv('./dataset/temp_clean.csv', index=False)
def plot_energy_vs_temp_by_year():
"""
Plots energy consumption against average temperature for each year.
"""
temp = pd.read_csv('./dataset/temp_clean.csv')
energy = pd.read_csv('./dataset/electric.csv')
energy_domestic = energy[['Year', 'Month', 'Domestic']]
df_combined = pd.merge(temp, energy_domestic)
for year in range(1970, 2021):
plt.clf()
df_filtered = df_combined[(df_combined['Year'] == year)]
plt.scatter(df_filtered['average temp'], df_filtered['Domestic'], marker='x', c='red', linewidths=1)
plt.title(year)
plt.xlabel("Average Temperature (°C)")
plt.ylabel("Domestic Energy Consumption (kWh)")
plt.savefig(f'./graph/{year}.png')
# Example usage
fetch_data() # Fetch and save temperature data
merge_csv('./dataset/part1.csv', './dataset/part2.csv') # Merge two parts of the dataset
normalize_data() # Normalize temperature data
plot_energy_vs_temp_by_year() # Plot energy consumption against temperature for each year