Обучение моделей на исторических данных: что нужно знать?

Training models on historical data: what you need to know?

Contents

What is the API for sports events and what data does it provide
Where to get historical data on sports events via API
Requirements for historical data for training forecasting models
How to prepare historical data from the API for machine learning
How to train a forecasting model for sports events on historical data
Best practices for working with sports statistics API when training models
Legal restrictions and risks of using sports events API in Russia

What is the API for sports events and what data does it provide

The sports events API is a software interface that provides developers with standardized access to match schedules, results, advanced statistics, team lineups, and bookmaker odds. Instead of parsing websites, you make an HTTP request to the server and receive structured JSON, ready for use in analytics, machine learning models, and betting services.

Level services Sports statistics API provide a unified data format for several sports at once: football, hockey, basketball, tennis, table tennis, esports, and other disciplines. Through a unified Sport Events API, you get:

a list of sports and their basic paths: endpoint /v2/sport;
categories and tournaments: /v2/{sportSlug}/categories, /v2/{sportSlug}/tournament/{tournamentId};
matches and match details: /v2/{sportSlug}/matches, /v2/{sportSlug}/matches/{matchId};
events during the game: /v2/{sportSlug}/matches/{matchId}/events;
teams and players with detailed information: /v2/{sportSlug}/teams, /v2/{sportSlug}/players;
field oddsBase in the match object — betting markets and bookmaker odds for multiple outcomes.

Thanks to a rich set of fields (score by halves, live events, advanced statistics matchStatistics, odds with dynamics) a single connection to the API covers the needs of both classic sports analytics and betting models. HTTP endpoints are already available, and in upcoming releases, WebSocket subscriptions and AI services for building intelligent hints and auto-generating features will appear.

Example of a simple API request for football matches on a specific date in JavaScript:

const API_KEY = 'YOUR_API_KEY';
fetch('https://api.api-sport.ru/v2/football/matches?date=2025-09-03', {
  headers: {
    Authorization: API_KEY,
  },
})
  .then((response) => response.json())
  .then((data) => {
    console.log('Всего матчей:', data.totalMatches);
    console.log('Первый матч:', data.matches[0]);
  })
  .catch((error) => {
    console.error('Ошибка запроса к Sport Events API', error);
  });

Such a response can be easily converted into a set of features for prediction models: you can take the current score, ball possession, number of shots, bookmaker odds, and use them for training or online inference.

Where to get historical data on sports events via API

Historical data is the foundation of any sports event prediction model. In the Sport Events API from api-sport.ru match history is available through the same endpoints as current games, but with filters by date, tournament, season, and teams. This simplifies the migration from real-time to archive: you only change the request parameters.

The basic way to get history is to use the filter дата in the method /v2/{sportSlug}/matches. You pass the date of the previous day in the format YYYY-MM-DD and receive all matches for the specified date. For filtering by leagues, use tournament_id (a comma-separated list is supported), and for filtering by seasons — a combination tournament_id и season_id, which can be requested through /v2/{sportSlug}/tournament/{tournamentId}/seasons.

After registering in the developer’s personal account you receive an API key and can programmatically export historical data to your storage, whether it’s a database, data lake, or just a set of files. Below is an example request in Python that retrieves all football matches for the previous date and outputs the number of games found:

import requests
API_KEY = 'YOUR_API_KEY'
headers = {
    'Authorization': API_KEY,
}
params = {
    'date': '2023-09-01',  # историческая дата
}
response = requests.get(
    'https://api.api-sport.ru/v2/football/matches',
    params=params,
    headers=headers,
)
response.raise_for_status()
data = response.json()
print('Матчей в выборке:', data.get('totalMatches'))
for match in data.get('matches', [])[:3]:
    print(match['tournament']['name'], '-', match['homeTeam']['name'], 'vs', match['awayTeam']['name'])

The same approach can be scaled: the script iterates through days, seasons, or tournaments, accumulates an archive, and forms a unified dataset. An important advantage of the Sport Events API is the unified response schema across different sports. This allows for building cross-sport models and using common data preparation pipelines.

Requirements for historical data for training forecasting models

The quality of historical data directly determines the accuracy of any sports event prediction model. For practical tasks (evaluating match outcomes, totals, handicaps, individual player statistics), the data must meet several key requirements. The Sport Events API takes these requirements into account at the level of structure and content of the responses.

First of all, completeness is important. The archive should show not only the final score but also the context of the game: events during the match, statistics by periods, team lineups, basic metadata of tournaments and seasons. In the Sport Events API, this is ensured by the fields домашнийСчет и выезднойСчет (breakdown by halves), an array of liveEvents (goals, cards, substitutions), as well as a block matchStatistics with advanced statistics (shots, possession, tackles, etc.).

Secondly, correct price signals are necessary for betting models. The field oddsBase in the match object contains betting markets, outcome groups, and bookmaker odds indicating current and starting values. Such data allows building models based on the closed line, analyzing odds movement, and assessing margin.

Thirdly, consistency and unambiguity of identifiers are important. Teams, players, tournaments, and seasons in the API have stable IDs, while matches have a unique field идентификатор and a start timestamp. startTimestamp. This eliminates duplicates and simplifies data merging over different periods. When training models, this is critical: each row of the dataset must uniquely correspond to one match or one segment of a match.

Finally, temporal depth and regularity are important. For most popular tournaments, the Sport Events API provides match history for several seasons forward and backward, allowing models to see long-term trends. At the same time, a unified response format by seasons simplifies dataset updates: you can regularly load new data with the same requests without breaking the existing schema.

How to prepare historical data from the API for machine learning

Raw responses from the Sport Events API are already well-structured, but for training models, they need to be transformed into a tabular format and carefully cleaned. Usually, the preparation process includes several steps: merging responses for different dates or tournaments, selecting necessary fields, normalizing values, encoding categories, and generating features.

The basic scheme is as follows. First, you export the match archive through /v2/{sportSlug}/matches with filters by dates and tournaments. Then, for the necessary matches, you request additional details: events /matches/{matchId}/events, team lineups, and statistics. After that, you form a dataset row for one match or for one time slice of the match (for example, the state at the 60th minute). The row includes numerical features (shots, possession, odds from oddsBase), categorical features (league, country, home/away team) and the target variable.

Below is a simplified example of Python code that transforms a list of matches from the API into a tabular dataset with basic features and a target label for the outcome (home win, draw, away win):

import pandas as pd
# matches_json — это результат вызова /v2/football/matches
def build_dataset(matches_json):
    rows = []
    for match in matches_json.get('matches', []):
        home = match['homeTeam']['name']
        away = match['awayTeam']['name']
        home_score = match['homeScore']['current']
        away_score = match['awayScore']['current']
        if home_score > away_score:
            outcome = 1  # победа хозяев
        elif home_score == away_score:
            outcome = 0  # ничья
        else:
            outcome = -1  # победа гостей
        odds_market = None
        home_odds = away_odds = draw_odds = None
        for market in match.get('oddsBase', []):
            if market.get('group') == '1X2':
                odds_market = market
                break
        if odds_market:
            choices = odds_market.get('choices', [])
            # обычно порядок: 1, X, 2
            if len(choices) >= 3:
                home_odds = choices[0]['decimal']
                draw_odds = choices[1]['decimal']
                away_odds = choices[2]['decimal']
        rows.append({
            'tournament': match['tournament']['name'],
            'home_team': home,
            'away_team': away,
            'home_score': home_score,
            'away_score': away_score,
            'home_odds': home_odds,
            'draw_odds': draw_odds,
            'away_odds': away_odds,
            'outcome': outcome,
        })
    return pd.DataFrame(rows)

In a real project, these basic fields are usually supplemented with aggregated metrics from previous matches of the team, player form, opponent strength, and other domain-specific features. All the necessary raw information is already available through the sports events API, and the developer’s task is to organize and prepare it correctly for training.

How to train a forecasting model for sports events on historical data

When historical data from the Sport Events API is prepared in tabular form, you can proceed to model training. The specific algorithm depends on the task: for classifying match outcomes, logistic regression and gradient boosting are suitable, for predicting scores and totals — regression, for online probability assessment in live — Bayesian and temporal models.

The overall scheme looks like this. First, you divide the dataset into training, validation, and test samples by time, so the model does not «peek» into the future. Then you select the target variable: for example, the outcome 1X2, the probability of the total being above a certain threshold, or the score spread. After that, you scale the numerical features, encode the categorical ones (league, team, country), and train the base model. At the final stage, you evaluate the quality using appropriate metrics: accuracy, logloss, ROC-AUC, and for betting — by expected return and profit stability.

Below is a brief example of training a simple model in the style of «match outcome 1X2» in Python using scikit-learn. It is assumed that you already have a dataframe with features X and labels y, formed from data from the Sport Events API:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
# X, y — результат подготовки данных из Sport Events API
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)
model = LogisticRegression(max_iter=1000, multi_class='multinomial')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('LogLoss:', log_loss(y_test, y_proba))

Based on such a basic model, more complex solutions can be built: ensembles, models considering match time, integration of live parameters, and coefficient movements. Historical data from the sports events API is suitable for both offline training on a large archive and periodic retraining on fresh matches when you want to adapt the model to new seasons and changes in leagues.

Best practices for working with sports statistics API when training models

To get the most out of historical data and not overload the infrastructure, it is important to establish a proper interaction with the sports statistics API. The platform api-sport.ru supports working with large volumes of information, but efficiency depends on how you design your data collection and training pipelines.

First, avoid direct API calls in training cycles. Historical matches should be exported in batches once, stored in your repository, and models should be trained on local data. For this, use filters дата, tournament_id, season_id and identifier lists ids, that allow you to retrieve up to 100 matches or teams in a single request.

Second, implement caching and reloading. Store the «raw» JSON responses from the Sport Events API along with the version of the schema used. This will allow you to reproduce training samples and correctly retrain models when new fields appear, such as additional metrics in matchStatistics or new markets in oddsBase. When the API changes, you will be able to update only the data preparation stage without touching the rest of the infrastructure.

Third, monitor reliability and speed. Handle network errors, build in retries, and limit request frequency. Soon, the Sport Events API ecosystem will feature WebSocket streams for live data and AI services that will help build hybrid solutions: the model is trained on historical archives, while in real-time it receives updates on events and odds, without wasting time on new HTTP requests.

An example of a request for detailed match information with bookmaker odds that can be used for enriching the training sample:

const API_KEY = 'YOUR_API_KEY';
const matchId = 14570728;
fetch(`https://api.api-sport.ru/v2/football/matches/${matchId}`, {
  headers: {
    Authorization: API_KEY,
  },
})
  .then((response) => response.json())
  .then((match) => {
    console.log('Статус матча:', match.status);
    console.log('Расширенная статистика:', match.matchStatistics);
    console.log('Коэффициенты букмекеров (oddsBase):', match.oddsBase);
  })
  .catch((error) => {
    console.error('Ошибка при получении данных матча', error);
  });

By following these practices, you reduce the load on the API, speed up training, and make your models more resilient to changes in data and infrastructure.

Legal restrictions and risks of using sports events API in Russia

When working with the sports events API in Russia, it is important to consider not only technical but also legal aspects. Access to match statistics, results, and bookmaker odds through the Sport Events API does not violate legislation by itself. However, the ways of using this data may be subject to regulation, especially when it comes to betting and interaction with end users.

As of 2024, the organization and acceptance of gambling activities in Russia is strictly regulated, including by law 244-FZ. If you plan to launch a product based on data from the sports events API that involves accepting bets from Russian users, you need to consider the requirements for licensing, client identification, and payment processing. The development of analytical services, forecasting models, recommendation systems, and internal risk models usually does not require a separate license, but it is always better to consult with a lawyer.

Special attention should be paid to the terms of use of the data from the API provider itself. The user agreement and documentation usually specify which scenarios are permitted: internal analytical accounting, public display of odds, use in mobile applications, integration with third-party products. Violating these terms may lead to the blocking of the key or other restrictions, even if you are not formally violating state legislation.

Finally, it is important to adhere to general principles of information security and data protection. Do not publicly disclose your API key, restrict access to historical data that is commercially valuable, and ensure that your models and interfaces do not mislead users regarding the risks of betting and the probability of outcomes. All of this will allow for the safe and legal use of the sports events API to build advanced analytical and betting solutions.