PPG_簡易上手的香港賽馬編程 [網絡爬蟲 — 獲取數據不求人] (1)

PPG HorseRacing

Nov 25, 2020

Image source : https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab

如何可以獲取以往的賽馬數據？坊間有不少人兜售賽馬數據，但是除了費用高昂外質素也成疑問。希望這個系列可以比到一個初步的概念大家如何不求人自己自動獲取並分析有用的資料。

網頁爬蟲難易一覽

簡單常見的爬蟲 (web scraping) (e.g.排位表)

排位表 - 賽馬資訊 - 香港賽馬會

Edit description

racing.hkjc.com

2. 對網絡爬蟲不友善的網站 (e.g. racingandsports不希望機械人佔用SERVER太多資源。會看用家提取資料的頻率而決定用家是否機械人)

A Shin Danshaku Thoroughbred Horse Profile - Next Race, Form, Stats, News, Breeding

A Shin Danshaku thoroughbred Horse profile, next race, formguide, stats, breeding, news, Jockey and trainer information…

www.racingandsports.com

3. 其他格式的檔案 (e.g. pdf / mp4)

不能直接獲取、也不能購買的寶貴數據如 : 自購馬來港前賽績、最後200米勝負距離…

我想這是最有趣的一部分、涉及到很多熱門話題如Machine learning, image recognition, moving object detection, pattern matching。有機會再說。

4. 隨機生成html tag的網站 (e.g. 唔關事 ge linkedln)

由淺入深，先講“簡單常見的爬蟲”。以下用日本某場3歳上500万下賽事為例子，希望大家可以舉一反三，練習自己拿取香港馬會的資料。

３歳上５００万下結果・払戻 | 2016年12月18日中山8R レース情報(JRA) - netkeiba.com

2016年12月18日中山8R ３歳上５００万下の結果・払戻です。JRA開催レースの出馬表や最新オッズ、レース結果速報、払戻情報をはじめ、競馬予想やデータ分析など予想に役立つ情報も満載です。

race.netkeiba.com

流程

Step 1 : 初學者可先到這篇文章看看如何在python運用SELENIUM。

Step 2 : 用開發者工具查看數據表格的tag id。因為tag id是獨一無二的，所以不用記下黃色的部分。

table#All_Result_Table

Step 3 : 開啟chrome driver，並鎖定目標表格。

應用webdriverwait / sleep
加入while以免網絡異常，加入try and except以免等待時間過長。

Step 4 : 將表格抄到python的list上。(list大約等於vba中的array)

8至13行能對應網上絕大部分的表格，很有用，不妨記下。

Step 5 : 儲存list到csv檔。

“a”是append模式，新加的數據會在舊有的檔案上疊加。

成果

拿到的數據就可以用來做數據處理(Data Massage)同埋分析(Feature Engineering)了。

如果喜歡我寫的文章就請follow我吧=]

下次再講。

Source Code

patllc/PPG_horseracing

AutoBet using Python Selenium. Contribute to patllc/PPG_horseracing development by creating an account on GitHub.

github.com

PPG_簡易上手的香港賽馬編程 [網絡爬蟲 — 獲取數據不求人] (1)

網頁爬蟲難易一覽

排位表 - 賽馬資訊 - 香港賽馬會

Edit description

A Shin Danshaku Thoroughbred Horse Profile - Next Race, Form, Stats, News, Breeding

A Shin Danshaku thoroughbred Horse profile, next race, formguide, stats, breeding, news, Jockey and trainer information…

３歳上５００万下結果・払戻 | 2016年12月18日中山8R レース情報(JRA) - netkeiba.com

2016年12月18日中山8R ３歳上５００万下の結果・払戻です。JRA開催レースの出馬表や最新オッズ、レース結果速報、払戻情報をはじめ、競馬予想やデータ分析など予想に役立つ情報も満載です。

流程

成果

patllc/PPG_horseracing

AutoBet using Python Selenium. Contribute to patllc/PPG_horseracing development by creating an account on GitHub.

Written by PPG HorseRacing

No responses yet

PPG_簡易上手的香港賽馬編程 [網絡爬蟲 — 獲取數據不求人] (1)

網頁爬蟲難易一覽

排位表 - 賽馬資訊 - 香港賽馬會

Edit description

A Shin Danshaku Thoroughbred Horse Profile - Next Race, Form, Stats, News, Breeding

A Shin Danshaku thoroughbred Horse profile, next race, formguide, stats, breeding, news, Jockey and trainer information…

３歳上５００万下 結果・払戻 | 2016年12月18日 中山8R レース情報(JRA) - netkeiba.com

2016年12月18日 中山8R ３歳上５００万下の結果・払戻です。JRA開催レースの出馬表や最新オッズ、レース結果速報、払戻情報をはじめ、競馬予想やデータ分析など予想に役立つ情報も満載です。

流程

成果

patllc/PPG_horseracing

AutoBet using Python Selenium. Contribute to patllc/PPG_horseracing development by creating an account on GitHub.

Written by PPG HorseRacing

No responses yet

３歳上５００万下結果・払戻 | 2016年12月18日中山8R レース情報(JRA) - netkeiba.com

2016年12月18日中山8R ３歳上５００万下の結果・払戻です。JRA開催レースの出馬表や最新オッズ、レース結果速報、払戻情報をはじめ、競馬予想やデータ分析など予想に役立つ情報も満載です。