Как парсить спортивные данные для ML-обработки?

How to parse sports data for ML processing?

Contents

What are sports APIs and what data do they provide for machine learning
Overview of popular sports APIs for data parsing: free and paid solutions
How to choose a sports API for ML tasks: criteria, limits, and data quality
How to parse sports data through an API in Python: examples of requests and code
Preparation and cleaning of sports data from APIs for machine learning models
How to use parsed sports data for predicting match outcomes

What are sports APIs and what data do they provide for machine learning

A sports API is a standardized programming interface that provides structured data about matches, teams, players, and odds in JSON format. Unlike parsing HTML pages, where any layout change breaks data collection, an API provides stable endpoints, a unified response format, and a clear field schema. This is fundamentally important for machine learning tasks: models are sensitive to the quality and completeness of input features.

Through the Sport Events API based on the platform api-sport.ru you can obtain data on various sports: football, hockey, basketball, tennis, table tennis, esports, and other disciplines. The main entities are sports types (/v2/sport), categories and tournaments (/v2/{sportSlug}/categories, /v2/{sportSlug}/tournament/{tournamentId}), matches and events (/v2/{sportSlug}/matches, /v2/{sportSlug}/matches/{matchId}/events), teams and players. For models, extended fields are especially valuable: currentMatchMinute for live tasks, arrays liveEvents, detailed matchStatistics and odds market oddsBase.

Thanks to this level of detail, a wide range of ML scenarios can be built: predicting outcomes and totals, assessing team strength, live models, analyzing the impact of lineup and tactics. You are working not with chaotic HTML, but with a logical hierarchy of objects: match, teams, statistics by periods, events by minutes, bookmaker odds, and even links to video reviews through the field highlights. This reduces the time for preparing datasets and allows you to focus on selecting model architectures and experiments.

[h3]Example of obtaining a list of matches for an ML project[/h3]

curl -X GET \
  'https://api.api-sport.ru/v2/football/matches?date=2025-09-03&status=finished' \
  -H 'Authorization: YOUR_API_KEY'

The response will contain an object with the field матчи, where for each match, tournament and season identifiers, statistics by halves, score are available (домашнийСчет, выезднойСчет), events and basic markets of odds. This data can be directly transformed into a feature table for training models.

Overview of popular sports APIs for data parsing: free and paid solutions

The sports API market is conditionally divided into three groups: official APIs of leagues and federations, global aggregators, and specialized commercial services. Official interfaces are often limited by sports types, require complex registration, and may not provide bookmaker odds. Large foreign aggregators offer a wide coverage of tournaments but are expensive and often excessive for targeted ML projects. Specialized solutions, such as api-sport.ru, focus on applied tasks — match predictions, analytics, and betting — and offer an optimal balance of data depth and cost.

Free or conditionally free APIs usually provide a limited set of sports, a short historical period, and strict limits on requests per minute and per day. This is suitable for prototyping models, educational projects, and testing hypotheses. Paid plans, on the contrary, include extended match history, access to live data, higher priority for request processing, and access to additional entities such as detailed statistical groups, team lineups, and bookmaker odds markets. For industrial ML systems, paid access often becomes necessary.

The platform api-sport.ru combines REST API for mass parsing of historical data and live updates, and actively develops new capabilities: the roadmap includes support for WebSocket for streaming event reception and integration of AI analysis tools. This opens the way to building online models that update the probability of match outcomes in real-time and use bookmaker odds from the block oddsBase as one of the key features.

How to choose a sports API for ML tasks: criteria, limits, and data quality

When choosing a sports API for machine learning, it is important to look not only at the price but also at the structure and completeness of the data. Evaluate coverage by sports and tournaments, depth of history (number of seasons and years), availability of advanced statistics on matches and players, events by minutes, and bookmaker odds. In the Sport Events API, fields are available for each match matchStatistics with breakdowns by periods and groups of indicators (shots, possession, duels), an array liveEvents with goals and cards, as well as betting markets oddsBase for main outcomes and totals — this is sufficient for building complex predictive models.

The second key block of criteria is technical specifications: clear documentation, endpoint stability, response speed, and limit systems. It is important to understand how many requests are allowed per minute and per day, whether there are restrictions on the sample size (for example, by date or number of matches), and how erroneous requests are handled. The presence of filters in requests (дата, tournament_id, team_id, status, season_id) directly affects the convenience of forming training samples and reduces the load on your infrastructure.

Finally, do not forget about legal and business aspects: the right to use data for commercial services, access to historical odds, transparent conditions for scaling limits. Built-in bookmaker data in oddsBase allows the ML team to use not only statistics but also the «collective opinion of the market.» Combined with the planned support for WebSocket and AI modules on the provider’s side, this makes the choice in favor of a mature service like api-sport.ru a strategic decision that will not need to be revisited in six months.

[h3]Example of filtering matches by date and tournaments[/h3]

import requests
API_KEY = 'YOUR_API_KEY'
url = 'https://api.api-sport.ru/v2/football/matches'
headers = {'Authorization': API_KEY}
params = {
    'date': '2025-09-03',
    'status': 'finished',
    'tournament_id': '7,17'  # несколько турниров через запятую
}
response = requests.get(url, headers=headers, params=params)
data = response.json()
print('Всего матчей:', data.get('totalMatches'))

How to parse sports data through an API in Python: examples of requests and code

To integrate the Sport Events API into an ML project in Python, basic libraries are sufficient. requests и pandas. First, you need to obtain an API key from your personal account at api-sport.ru, then pass it in the header Authorization with each request. The basic endpoint for fetching matches by sport type looks like /v2/{sportSlug}/matches, where sportSlug — this is, for example, football, basketball, tennis or esports.

[pyrthon]
[/pyrthon]

import requests
import pandas as pd
API_KEY = 'YOUR_API_KEY'
BASE_URL = 'https://api.api-sport.ru/v2/football/matches'
headers = {'Authorization': API_KEY}
params = {
    'date': '2025-09-03',
    'status': 'finished'
}
resp = requests.get(BASE_URL, headers=headers, params=params)
resp.raise_for_status()
raw = resp.json()
matches = raw.get('matches', [])
rows = []
for m in matches:
    rows.append({
        'match_id': m['id'],
        'tournament': m['tournament']['name'],
        'home_team': m['homeTeam']['name'],
        'away_team': m['awayTeam']['name'],
        'home_goals': m['homeScore']['current'],
        'away_goals': m['awayScore']['current'],
        'start_ts': m['startTimestamp'],
        'current_minute': m.get('currentMatchMinute'),
        'has_odds': bool(m.get('oddsBase'))
    })
df = pd.DataFrame(rows)
print(df.head())

In this code snippet, we request all completed football matches for the selected date, extract key fields from the response, and form a tabular representation of the data. Similarly, match events can be parsed through /v2/football/matches/{matchId}/events, statistics through /v2/football/matches/{matchId}, as well as player and team lists through endpoints /v2/{sportSlug}/players и /v2/{sportSlug}/teams. Having obtained a DataFrame, you can save it to a database or file storage and use it as a basis for feature engineering.

Preparation and cleaning of sports data from APIs for machine learning models

Parsing sports data through the API is just the first step. For machine learning models to work reliably, systematic cleaning and transformation of the sample is necessary. First, the field types should be converted to numerical or categorical forms: convert timestamps startTimestamp to the date and time of the match, turn strings like «54%» from matchStatistics into fractions from 0 to 1, and split composite strings (for example, «4/7 (57%)» for accurate passes) into several numerical features. Secondly, it is important to standardize the identifiers of tournaments, seasons, and teams to merge data from different endpoints without errors.

The next step is dealing with missing values and outliers. Not all matches have the same set of statistics: for lesser-known tournaments or old seasons, some fields may be missing. Typical strategies include discarding rare tournaments where too few statistical groups are filled, filling in missing values with the league median, or using special marker values. When working with bookmaker odds from the block oddsBase It is useful to store the original values (field initialDecimal) and current values (decimal), as well as the direction of change (изменить). This will allow models to capture market dynamics.

[pyrthon]
[/pyrthon]

import pandas as pd
# Предполагаем, что у нас есть DataFrame stats_df с полем ball_possession в формате '54%'
def clean_percent(series):
    return pd.to_numeric(series.str.replace('%', ''), errors='coerce') / 100.0
stats_df['ball_possession_home'] = clean_percent(stats_df['ball_possession_home'])
stats_df['ball_possession_away'] = clean_percent(stats_df['ball_possession_away'])
# Пример обработки сложной строки вида '70/135 (52%)'
def split_ratio_with_percent(series):
    nums = series.str.extract(r'(\d+)/(\d+)')
    nums = nums.astype(float)
    return nums[0] / nums[1]
stats_df['final_third_eff_home'] = split_ratio_with_percent(stats_df['final_third_phase_home'])

It is important to maintain temporal causality when forming the sample: for a match, one cannot use statistics that appeared after its completion, or final odds if you are modeling pre-match forecasts. Practically, this means that the dataset must clearly separate features into pre-match (team history, lineups, pre-match odds) and post-match (actual goals, shots, cards), which are used only as a target variable and for subsequent model quality analysis.

How to use parsed sports data for predicting match outcomes

After cleaning and aggregating data from the sports API, one can proceed to model building. The most common case is predicting the match outcome (home win, draw, away win) or total goals. Features include historical results, aggregates of statistics from recent matches, opponent strength, league position, and bookmaker odds from oddsBase. For classification tasks, logistic regression, gradient boosting, and neural networks are suitable; for total goals regression — the same methods with the appropriate metric (MAE, RMSE).

A key point is the correct splitting of the sample into training and testing considering time. Matches cannot be shuffled randomly: future games should not enter the training sample of the model that is evaluated on the past. Most often, data is sorted by date, using the first 70-80 % matches for training, and the remaining part for validation and testing. Additionally, cross-validation can be set up using temporal «rolling windows» to check the model’s quality stability over different periods.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
# df — подготовленный DataFrame с признаками и целевой переменной target_home_win (0/1)
features = [
    'home_goals_last5', 'away_goals_last5',
    'shots_on_goal_diff', 'ball_possession_diff',
    'odds_home', 'odds_draw', 'odds_away'
]
X = df[features]
y = df['target_home_win']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False  # разбиение по времени
)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]
print('ROC-AUC:', roc_auc_score(y_test, probs))

Based on such models, it is possible to build betting recommendation systems, analytics services for fans and professional clubs, as well as automated scripts that track discrepancies between the model probabilities and the bookmakers’ line. Using a reliable data source, such as the Sport Events API from api-sport.ru, combined with future support for WebSocket and AI tools, allows for the construction of a comprehensive ML pipeline: from data collection and streaming processing to online updating of predictions during the match.