Feature Engineering Post Bias(1)
(Please scroll down for the English version)

由淺入深。
檔位對馬匹取勝的機率當然有影響。最簡單的看法是排檔越外,轉彎就越容易留在外疊,腳程就會越蝕。
見馬會的數據:https://racing.hkjc.com/racing/information/Chinese/racing/Draw.aspx

那我們如何在Python的基礎上高效地活用數據,仔細研究檔位/其他features對馬匹的影響呢?
今日的目標
- 用Python建立Database
- 簡單的資料處理 (Data Massage)
- 了解檔位對疊數、勝率的影響
流程
Step 1 : 下載我先準備好的2018賽事數據(之後有機會再談如何在網上拿取數據)

Step 2 : 下載libraries (pandas / numpy 超有用) - pandas用來做Database,numpy用來計算matrix
pip install pandas
pip install numpy
Step 3 : 建立Database
Step 4 : 簡單的資料處理 (Data Massage)
有些馬可能會因為一些因素(e.g. 騎師低能)而飛到6789疊。我想將佢地group做4疊。
Step 5 : 了解檔位對疊數、勝率的影響
groupby有點像excel pivottable的應用。黃色的部份以後我們會常常用到,可以先記下來。

外檔只有20%的馬匹能以第一疊入直路。

總結
- 其實python/pandas做到的東西很多,在數據很多(大數據如以後會說的天氣對馬匹影響)的情況下,microsoft office是不能應付的。
- 進階者可以想想疊數如何反映在你的model上,又可以想想如何可以拿到這數據。
下次再說。
Let’s start with something easy.
The draw has an observable impact on the win probability of the horse. The simplest explanation is a horse from the outer draw, the further away from the fence when going around each bend, the longer distance the horse needs to travel.
Please see the data from the Hong Kong Jockey Club: https://racing.hkjc.com/racing/information/English/racing/Draw.aspx

Then how can we use data efficiently to investigate the impact of Draw and other features on the horse using Python?
Today’s Objective
- Use Python to build a database
- Simple data Massage
- Understand how Draw influence Wide* and the probability of a win
*” Wide” represents how wide the horse traveled as it went around each bend– in sequence from left to right. i.e. 1W =1 wide (rail); 2W=2 wides; 3W=3 wides; 4W=4 wides and so on.
STEPS:
Step 1: Please download the 2018 Racing Database (I will cover how we can extract data online in a later chapter)

Step 2: Download libraries (pandas / numpy is very useful) — pandas to build Database, while numpy for calculating matrix
pip install pandas
pip install numpy
Step 3: Build the Database
Step 4: Simple Data Massage
Some horses may have 6/7/8/9 W for some reasons, e.g. Stupid Jockeys. I would like to group them into 4W
Step 5: Understand how Draw influence Wide* and the probability of a win
groupby is a bit similar to the application of excel pivot table. We will always cover the yellow part in a later chapter, you can bookmark it first.

Only 20% of the horse from the outer Draw can turn into home straight with 1W.

Conclusion
- In fact, there are a lot of things that can be achieved by python/pandas. Microsoft office is unable to handle a massive amount of data (e.g. we will cover the big data on the weather later)
- For advance reader, you may think about how can you reflect the impact of Wide on the model, and how you can extract this piece of data
Stay tuned!