How to train a model for team dominance detection?

What is team dominance in sports and how to measure it using match data

Team dominance in a match is not just the score on the scoreboard. A team may lead by one goal but hardly leave defense, or conversely, be losing while constantly creating chances and pressuring the opponent. For analytics, betting, and automated recommendations, it is important to formalize dominance through quantitative indicators that can be extracted from match data: possession, shots, dangerous attacks, duels, fouls, and other metrics.

In practice, dominance is conveniently described as a target feature for a machine learning model. It can be defined as a binary class (dominates / does not dominate), a multiclass (home team dominates, equal game, away team dominates), or a continuous index from 0 to 1. Such an index is calculated based on a combination of statistical indicators over a chosen time interval: the entire match, a half, a five-to-ten-minute window, or a specific game segment after a goal or a red card.

The key to objectively measuring dominance is a rich and detailed set of statistics. Through a specialized sports API, you obtain a data structure for the match: overall and by halves statistics, live events, lineups, tournament context, and bookmaker odds. Based on this set, the model can assess which team controls the game and update predictions as new events occur.

import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.api-sport.ru/v2/football"
MATCH_ID = 14570728  # пример ID матча
headers = {"Authorization": API_KEY}
url = f"{BASE_URL}/matches/{MATCH_ID}"
response = requests.get(url, headers=headers)
match = response.json()
# Извлечем сводную статистику по матчу (period == "ALL")
stats_all = next(
    (p for p in match.get("matchStatistics", []) if p.get("period") == "ALL"),
    None,
)
if stats_all:
    overview_group = next(
        (g for g in stats_all["groups"] if g["groupName"] == "Match overview"),
        None,
    )
    if overview_group:
        for item in overview_group["statisticsItems"]:
            if item["key"] in ("ballPossession", "totalShotsOnGoal", "shotsOnGoal"):
                print(item["name"], ":", item["home"], "vs", item["away"])

What data from the sports API is needed for the team dominance detection model

For a quality dominance detection model, it is important to have access to the most detailed match data. In Sports events API you get several key layers of information: basic match parameters (tournament, teams, status, current minute), advanced statistics matchStatistics by periods, live events liveEvents, as well as bookmaker odds oddsBase. This set allows for assessing both the game pattern and market expectations.

Match statistics include dozens of metrics: possession, shots, shots on target, passes, entries into the final third, corners, fouls, tackles, duels, goalkeeper saves, and much more. Different sports use their own sets of indicators: in football and hockey, shots and dangerous moments are important, in basketball – offensive efficiency and rebounds, in tennis – serves and converted break points. Live events complement the statistics with context: goals, red cards, penalties, long pauses, which allows for accounting changes in the balance of power during the match.

Data on bookmaker odds oddsBase provide an idea of how the market evaluates the strengths of teams and the dynamics of dominance in live. By using the history of odds changes along with current statistics, more robust models can be built that reduce noise and overfitting. As a result, your dataset for training the dominance model includes numerical features from statistics, categorical features of the match context, and time series of odds changes.

import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.api-sport.ru/v2/football"
headers = {"Authorization": API_KEY}
params = {
    "date": "2025-09-03",   # дата матчей
    "status": "finished",   # завершенные матчи
}
resp = requests.get(f"{BASE_URL}/matches", headers=headers, params=params)
data = resp.json()
for match in data.get("matches", []):
    print("Матч:", match["homeTeam"]["name"], "-", match["awayTeam"]["name"])
    print("Турнир:", match["tournament"]["name"])
    print("Текущая минута:", match.get("currentMatchMinute"))
    print("Есть статистика:", bool(match.get("matchStatistics")))
    print("Есть live события:", bool(match.get("liveEvents")))
    print("Есть коэффициенты:", bool(match.get("oddsBase")))
    print("---")

Choosing metrics and features for the team dominance model based on match statistics

The next step is to transform raw metrics from the sports API into informative features for the model. To assess dominance in football, metric blocks are often used: control (possession, accurate passes, entries into the final third), chance creation (shots, shots on target, big chances), pressure (corners, free kicks near the goal, entries into the penalty area), defensive actions (tackles, interceptions, clearances, saves). In the statistical data, these parameters are represented as groups and elements within an array. matchStatistics.

It is important not only to take absolute values but also to normalize them by time and context. For example, shots and shots on target are divided by the number of minutes played to compare segments of different lengths. Ball possession is conveniently used as a fraction from 0 to 1, rather than a string «54%». Some metrics are useful to reduce to the difference between home and away teams or to relative superiority: (home − away) / (home + away). This approach makes features invariant to the overall pace of the game and the level of teams.

Additionally, it makes sense to include features based on live events and odds in the model: the presence of red cards, recent goals, the number of substitutions, the pre-match favorite level based on starting odds, and the current market imbalance in live. Based on this data, a composite dominance index is formed, which is then used as a target variable or as an auxiliary target for more complex AI models, for example, for predicting the next goal.

from typing import Dict, Any

def build_features_from_match(match: Dict[str, Any]) -> Dict[str, float]:
    """Пример выделения простых признаков доминирования из структуры матча."""
    features = {}
    stats_all = next(
        (p for p in match.get("matchStatistics", []) if p.get("period") == "ALL"),
        None,
    )
    if not stats_all:
        return features
    def get_item(key: str):
        for g in stats_all["groups"]:
            for it in g["statisticsItems"]:
                if it["key"] == key:
                    return it
        return None
    # Примеры признаков: владение и удары
    poss = get_item("ballPossession")
    shots_total = get_item("totalShotsOnGoal")
    shots_on = get_item("shotsOnGoal")
    if poss:
        features["possession_home"] = poss["homeValue"] / 100.0
        features["possession_diff"] = (poss["homeValue"] - poss["awayValue"]) / 100.0
    if shots_total:
        features["shots_total_diff"] = (
            shots_total["homeValue"] - shots_total["awayValue"]
        )
    if shots_on:
        features["shots_on_diff"] = shots_on["homeValue"] - shots_on["awayValue"]
    # Пример бинарного признака: есть ли красная карточка у хозяев/гостей
    home_red = any(
        ev["type"] == "card" and ev["team"] == "home" and "red" in ev.get("class", "")
        for ev in match.get("liveEvents", [])
    )
    away_red = any(
        ev["type"] == "card" and ev["team"] == "away" and "red" in ev.get("class", "")
        for ev in match.get("liveEvents", [])
    )
    features["home_red_card"] = float(home_red)
    features["away_red_card"] = float(away_red)
    return features

How to prepare data from the sports API for training a machine learning model

Preparing the dataset is a critically important step before training any model, including for detecting dominance. You will need to collect match history over several seasons for selected tournaments and sports, exporting them via the method. /v2/{sportSlug}/matches. For each match, it is necessary to save statistics for the entire match and by periods, events, odds, and the final result. Then, based on this data, the target variable is formed: for example, the class «home team dominated» if they created more dangerous chances and outperformed the opponent in terms of xG-proxy (shots, shots on target, big chances).

Next, it is important to handle missing values and bring the data to a unified format. Statistics with periods (1st half, 2nd half, full match) can be aggregated or used as time slices. Text values of percentages are converted to numbers, complex strings like «70/135 (52%)» are parsed into actual counts and percentages. It is also important to filter out matches with incomplete statistics or erroneous values. After this, a «match – features» table is constructed, where each row corresponds to either a whole match or a fixed time in the match (for example, the 60th minute).

Finally, the dataset needs to be split into training, validation, and test samples by time to avoid leaking future information. Most often, splitting by seasons or along the time axis is used. At this stage, it is useful to save the prepared set in a convenient format (CSV/Parquet) and document the feature construction code to later apply it to live data. An API key for automated export can be obtained in. the personal account. and integrate the process into the regular ETL pipeline.

import requests
import datetime as dt
import pandas as pd
API_KEY = "YOUR_API_KEY"
SPORT = "football"
BASE_URL = f"https://api.api-sport.ru/v2/{SPORT}"
headers = {"Authorization": API_KEY}
rows = []
start_date = dt.date(2025, 8, 1)
end_date = dt.date(2025, 8, 31)
cur = start_date
while cur <= end_date:
    params = {"date": cur.isoformat(), "status": "finished"}
    resp = requests.get(f"{BASE_URL}/matches", headers=headers, params=params)
    data = resp.json()
    for match in data.get("matches", []):
        # Здесь можно вызвать build_features_from_match(match)
        # и добавить целевой признак dominance_label
        row = {"match_id": match["id"], "date": match["dateEvent"]}
        # ... заполняем признаки ...
        rows.append(row)
    cur += dt.timedelta(days=1)
_df = pd.DataFrame(rows)
_df.to_csv("matches_features.csv", index=False)

How to train and test the team dominance detection model step by step

When the dataset is prepared, you can proceed to model training. At the first stage, it is sufficient to use interpretable algorithms: logistic regression, decision trees, or gradient boosting. The target variable indicates the fact of dominance (for example, 1 – the hosts dominated, 0 – did not), while the input features are formed from statistics and events. It is important to split the data by time: early seasons for training, later ones for validation and final testing. This way, you will get a realistic assessment of quality in conditions close to the model’s operation in production.

During training, it is important to monitor class balance and choose quality metrics that reflect the task: ROC-AUC, precision/recall, F1-score, Brier score for probabilistic forecasts. It is useful to conduct cross-validation by seasons or time blocks to ensure that the model does not overfit to a specific tournament or year. The interpretation of the most important features (feature importance) will indicate which statistical parameters are most strongly associated with dominance: perhaps it is not only shots and possession but also the structure of passes, the number of entries into the final third, or pressure from corners.

After selecting the final model, it should be tested on a holdout sample and on several real matches, losing them by minutes. This way, you will check how consistently the model reacts to goals, red cards, and sharp changes in statistics. Then you can implement the export of trained weights or model serialization for further integration with the operational service, which will call the sports API and update the dominance assessment in real time.

import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
# Загрузка подготовленных данных
_df = pd.read_csv("matches_features_labeled.csv")
features = [c for c in _df.columns if c not in ("dominance_label", "match_id", "date")]
X = _df[features].values
y = _df["dominance_label"].values
# Временная кросс-валидация
cv = TimeSeriesSplit(n_splits=5)
aucs = []
for train_idx, test_idx in cv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model = GradientBoostingClassifier(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred)
    aucs.append(auc)
print("Средний ROC-AUC:", sum(aucs) / len(aucs))
# Финальное дообучение на всей истории
final_model = GradientBoostingClassifier(random_state=42)
final_model.fit(X, y)

Integrating the team dominance detection model with real-time sports event API

After training and testing the model, a key stage follows – integration with the operational infrastructure and sports API. In real time, your service regularly requests match data through the method /v2/{sportSlug}/matches/{matchId}, extracts current statistics and events, runs them through the same feature building pipeline, and passes them to the model. As a result, you get the current probability of dominance for one of the teams or an index from -1 to 1, which can be visualized in the interface, used for alerts, or for internal trading algorithms.

Currently, integration is built through periodic polling of REST endpoints, but it is already possible to design the architecture for streaming. As the platform develops api-sport.ru the addition of WebSocket connections and AI services is planned, which will allow receiving updates on match events with minimal delay and directly calling prepared models. This is especially important for live analytics tasks, where the dominance assessment should be updated after each dangerous moment, red card, or change in bookmaker odds.

Final integration includes quality monitoring and logging. For each match, it makes sense to save the sequence of dominance ratings and compare them with the final result and key segments of the game. Based on these logs, the model can be retrained, features can be fine-tuned, and the algorithm can be adapted for new tournaments and sports. By using the flexibility of the sports API and the extensibility of your own model, you build a robust analytics system that scales from a single championship to global coverage of sports and esports.

import time
import requests
API_KEY = "YOUR_API_KEY"
SPORT = "football"
MATCH_ID = 14570728
BASE_URL = f"https://api.api-sport.ru/v2/{SPORT}"
headers = {"Authorization": API_KEY}
POLL_INTERVAL = 60  # секунд
while True:
    resp = requests.get(f"{BASE_URL}/matches/{MATCH_ID}", headers=headers)
    match = resp.json()
    if match["status"] == "finished":
        print("Матч завершен")
        break
    features = build_features_from_match(match)
    # dominance_score = final_model.predict_proba([list(features.values())])[0, 1]
    # В реальном коде используем сохраненную модель
    dominance_score = 0.5  # заглушка
    print(
        f"Минута {match.get('currentMatchMinute')}: индекс доминирования хозяев = {dominance_score:.3f}"
    )
    time.sleep(POLL_INTERVAL)