Feature Engineering Post Bias(1)

PPG HorseRacing
7 min readNov 23, 2020

--

(Please scroll down for the English version)

由淺入深。

檔位對馬匹取勝的機率當然有影響。最簡單的看法是排檔越外,轉彎就越容易留在外疊,腳程就會越蝕。

見馬會的數據:https://racing.hkjc.com/racing/information/Chinese/racing/Draw.aspx

例子 : 福穎 21/11/2018 三疊入直路

那我們如何在Python的基礎上高效地活用數據,仔細研究檔位/其他features對馬匹的影響呢?

今日的目標

  1. 用Python建立Database
  2. 簡單的資料處理 (Data Massage)
  3. 了解檔位對疊數、勝率的影響

流程

Step 1 : 下載我先準備好的2018賽事數據(之後有機會再談如何在網上拿取數據)

數據樣本
你地估下係邊隻馬跑9疊入彎?

Step 2 : 下載libraries (pandas / numpy 超有用) - pandas用來做Database,numpy用來計算matrix

pip install pandas

pip install numpy

Step 3 : 建立Database

Step 4 : 簡單的資料處理 (Data Massage)

有些馬可能會因為一些因素(e.g. 騎師低能)而飛到6789疊。我想將佢地group做4疊。

Step 5 : 了解檔位對疊數、勝率的影響

groupby有點像excel pivottable的應用。黃色的部份以後我們會常常用到,可以先記下來。

結果當然是內檔勝率較高

外檔只有20%的馬匹能以第一疊入直路。

總結

  1. 其實python/pandas做到的東西很多,在數據很多(大數據如以後會說的天氣對馬匹影響)的情況下,microsoft office是不能應付的。
  2. 進階者可以想想疊數如何反映在你的model上,又可以想想如何可以拿到這數據。

下次再說。

Let’s start with something easy.

The draw has an observable impact on the win probability of the horse. The simplest explanation is a horse from the outer draw, the further away from the fence when going around each bend, the longer distance the horse needs to travel.

Please see the data from the Hong Kong Jockey Club: https://racing.hkjc.com/racing/information/English/racing/Draw.aspx

Example: BURST AWAY 21/11/2018 turn into home straight with 3W

Then how can we use data efficiently to investigate the impact of Draw and other features on the horse using Python?

Today’s Objective

  1. Use Python to build a database
  2. Simple data Massage
  3. Understand how Draw influence Wide* and the probability of a win

*” Wide” represents how wide the horse traveled as it went around each bend– in sequence from left to right. i.e. 1W =1 wide (rail); 2W=2 wides; 3W=3 wides; 4W=4 wides and so on.

STEPS:

Step 1: Please download the 2018 Racing Database (I will cover how we can extract data online in a later chapter)

Data Sample
Guess which horse is 9W when going around the bend?

Step 2: Download libraries (pandas / numpy is very useful) — pandas to build Database, while numpy for calculating matrix

pip install pandas

pip install numpy

Step 3: Build the Database

Step 4: Simple Data Massage

Some horses may have 6/7/8/9 W for some reasons, e.g. Stupid Jockeys. I would like to group them into 4W

Step 5: Understand how Draw influence Wide* and the probability of a win

groupby is a bit similar to the application of excel pivot table. We will always cover the yellow part in a later chapter, you can bookmark it first.

As expected, Inner Draw has a higher win probability

Only 20% of the horse from the outer Draw can turn into home straight with 1W.

Conclusion

  1. In fact, there are a lot of things that can be achieved by python/pandas. Microsoft office is unable to handle a massive amount of data (e.g. we will cover the big data on the weather later)
  2. For advance reader, you may think about how can you reflect the impact of Wide on the model, and how you can extract this piece of data

Stay tuned!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Responses (2)

Write a response