Как чистить и нормализовать спортивные данные перед ML-обработкой?

How to clean and normalize sports data before ML processing?

Contenidos

How to obtain sports data via API for machine learning
Cleaning sports data from missing values, outliers, and duplicates before ML
Normalization and scaling of sports metrics for machine learning models
How to combine sports data from different APIs and bring them to a unified format
Standardization of time series and events in sports data for ML processing
Tools and libraries for cleaning and normalizing sports data through the API

How to obtain sports data via API for machine learning

A reliable data pipeline for machine learning models starts with the right source. The platform api-sport.ru provides a unified API for multiple sports (football, hockey, basketball, tennis, table tennis, esports, and others), as well as for bookmaker information: lines, odds, and betting markets. This allows for the formation of training samples for predictive models, recommendation systems, and pricing systems for bets without complex integration with dozens of disparate sources.

The architecture of the Sport Events API is built around the concept of sportSlug. First, you request a list of available disciplines through the endpoint /v2/deporte, then for a specific sport (e.g., football or esports) you get matches, tournaments, seasons, teams, players, and detailed statistics. The base URL is the same for all requests: https://api.api-sport.ru. Authorization is carried out via API key in the header Autorización, which you can generate and manage in tu cuenta personal en api-sport.ru.

An example of obtaining matches and odds for sample formation

For machine learning tasks, match endpoints are most often used: /v2/{sportSlug}/partidos for sampling by date, tournament, team, and /v2/{sportSlug}/matches/{matchId} for complete information on a specific game. In the response, you receive both basic fields (status, score, lineups) and extended statistics estadísticasDelPartido, live events eventosEnVivo and the odds block oddsBase, which can be used in pricing models or anomaly detection.

[prefixlanguage]

import requests
API_KEY = 'ВАШ_API_КЛЮЧ'
BASE_URL = 'https://api.api-sport.ru/v2/football/matches'
headers = {
    'Authorization': API_KEY,
}
params = {
    'date': '2025-09-03',  # матчи за конкретный день
    'status': 'finished',  # только завершенные игры для обучения
}
response = requests.get(BASE_URL, headers=headers, params=params)
data = response.json()
matches = data.get('matches', [])
print('Загружено матчей:', len(matches))

At this stage, it is important to immediately think about what entities the model will need: only the final score and xG, complete match statistics, events by minute, player data, or the block oddsBase with the dynamics of coefficients. Thanks to flexible filtering and the rich structure of API responses, you can minimize the amount of unnecessary information and pass exactly the fields needed for further cleaning and normalization into the pipeline.

Cleaning sports data from missing values, outliers, and duplicates before ML

Even when using a quality API, data preparation for machine learning inevitably includes a deep cleaning stage. In sports statistics, there are missing values (for example, part of the statistics is missing for a minor tournament), outliers (an anomalous score due to a technical match), and duplicates (the same match collected by different filters). If these artifacts are not processed, the final model will overfit, provide biased estimates, and poorly generalize to new data.

When working with responses from the Sport Events API, it makes sense to separate types of fields. Categorical identifiers (tournament ID, team ID, player ID) usually do not contain missing values, while numerical indicators in estadísticasDelPartido and derived metrics may be partially filled. Missing values in key features (for example, totalShots or ballPossession) are better either restored using domain rules or such observations discarded if their share is small. Anomalous values are more conveniently identified using statistical criteria (z-score, interquartile range) and domain constraints, for example, the total number of shots on goal cannot be negative.

Practical example of basic data cleaning for matches

After receiving the list of matches via the API, a typical step is to transform the data into a table using pandas and further filter it. At the code level, you remove duplicates by match ID, discard games without a final score, and normalize basic fields. This approach creates a solid foundation for subsequent feature collection steps.

import pandas as pd
matches = data.get('matches', [])  # результат запроса к /v2/{sportSlug}/matches
rows = []
for m in matches:
    rows.append({
        'match_id': m['id'],
        'start_ts': m['startTimestamp'],
        'status': m['status'],
        'home_score': m['homeScore']['current'] if m.get('homeScore') else None,
        'away_score': m['awayScore']['current'] if m.get('awayScore') else None,
    })
df = pd.DataFrame(rows)
# удаляем дубликаты и матчи без финального счета
df = df.drop_duplicates(subset=['match_id'])
mask_finished = df['status'] == 'finished'
mask_has_score = df['home_score'].notna() & df['away_score'].notna()
df_clean = df[mask_finished & mask_has_score].copy()
print('После очистки осталось матчей:', len(df_clean))

For advanced scenarios, you can add automatic logging of problematic records (almost empty statistical blocks, impossible timestamp values, suspicious coefficients in oddsBase) and move the cleaning rules to a separate module. This will allow reproducibly preparing datasets when retraining models and flexibly adapting procedures to new sports and new data providers that you connect via the API.

Normalization and scaling of sports metrics for machine learning models

After removing noise and errors, the next task is to bring heterogeneous numerical features to comparable scales. In sports data, one dataset may coexist with the number of shots (units), ball possession (percentages), time on the court (minutes), xG, and totals based on betting odds. Without normalization, such features affect algorithms sensitive to scale differently, for example, linear models, kNN, or gradient boosting.

Data obtained from fields estadísticasDelPartido, puntajeLocal/puntajeVisitante and blocks oddsBase, it is convenient to transform using standard approaches: standardization (z-score), min-max scaling, or logarithmization for heavily skewed distributions. It is important to separate features by meaning: counts and event numbers are usually limited to small values and are well processed by min-max normalization, while metrics like players’ market value or totals by odds often require log transformation.

Example of scaling match statistics before training

In practice, you can first project the complex structure of the API response onto a compact set of numbers, and then apply scaling from scikit-learn. Below is an example of normalizing scores and basic statistics for input into the match outcome prediction model. A similar approach can be extended to time series by aggregating values by match or by time segments.

from sklearn.preprocessing import StandardScaler
import numpy as np
# допустим, у нас уже есть датафрейм df_stats по матчам
feature_cols = ['home_score', 'away_score', 'shots_total', 'ball_possession_home']
X = df_stats[feature_cols].fillna(0.0).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print('Среднее по признакам после стандартизации:', np.mean(X_scaled, axis=0))

If you are working with several sports available through the Sport Events API, it makes sense to build separate normalization pipelines for each discipline. For example, the average total goals in football and hockey differ, while the distribution of points in basketball is significantly wider. Extract scaling parameters (means and standard deviations, min-max boundaries) into configuration and save them along with the model version. This will ensure reproducibility of results and correct processing of new data coming through the API in production.

How to combine sports data from different APIs and bring them to a unified format

Real ML systems rarely rely on a single source. In addition to match statistics and odds provided by por el API de eventos deportivos api-sport.ru, you can use additional data: proprietary labels, analyst forecasts, user behavior data. For the model to correctly perceive such a mix, it is necessary to bring all sources to a unified format and a unified system of identifiers.

The Sport Events API simplifies the task with stable entity IDs: matches, tournaments, seasons, teams, and players. You can use the match identifier идентификатор and the associated context (tournament, season, category) to join external tables. For bookmakers and product teams, this may include betting history, click logs, internal risk assessment. It is important to define a canonical data layer in advance, where each match is represented once and supplemented with all necessary features.

Example of combining match statistics and external features

Below is a simplified integration scheme: first, you upload matches via the endpoint /v2/{sportSlug}/partidos, then you attach external data by key match_id. This strategy allows you to gradually expand features without breaking existing pipelines and without changing the API contract.

import pandas as pd
# df_api — данные из Sport Events API
# df_ext — внешние признаки по матчам (например, внутренние метки качества)
# убеждаемся, что ключи одного типа
df_api['match_id'] = df_api['match_id'].astype(int)
df_ext['match_id'] = df_ext['match_id'].astype(int)
# джойн с сохранением только пересечения
merged = pd.merge(df_api, df_ext, on='match_id', how='inner')
print('Объединенный датасет:', merged.shape)

When working with several types of sports, it is also useful to standardize the naming of features. For example, create abstract fields like team_score_home, team_score_away, shots_on_goal, which are filled based on estadísticasDelPartido for each specific sportSlug. Such an abstract layer allows reusing the same ML code for football, hockey, and basketball, as well as making it much easier to scale the system to new disciplines that are gradually added to the API.

Standardization of time series and events in sports data for ML processing

Many advanced models in sports work not only with aggregated match statistics but also with time series: dynamics of odds, sequence of events, possession of the ball by minutes. This format is especially important for live models and early warning systems for risks. However, the raw events that you receive via the endpoint /v2/{sportSlug}/matches/{matchId}/events and the field eventosEnVivo in match details, have different frequencies and are distributed unevenly over time.

To use sequences in models (LSTM, transformers, TCN), they need to be standardized: brought to a fixed time step (for example, one row per minute of the match), normalized timestamp format, carefully handled gaps (minutes without events). For betting coefficients from the block oddsBase the task is similar: alignment by timestamps, aggregation of spikes, and removal of technical updates.

Example of bringing match events to a minute grid

Below is an example of how to transform an array of match events into a regular time series suitable for training sequential models. We create one record per minute of the game and incrementally update the score so that the model sees the course of the match, not just the final result.

import pandas as pd
# events — ответ на /v2/{sportSlug}/matches/{matchId}/events
minute_rows = []
home_goals = 0
away_goals = 0
for t in range(0, 91):  # минуты матча
    # фильтруем события текущей минуты
    ev_minute = [e for e in events if e['time'] == t]
    for e in ev_minute:
        if e['type'] == 'goal':
            if e['team'] == 'home':
                home_goals += 1
            else:
                away_goals += 1
    minute_rows.append({
        'minute': t,
        'home_goals': home_goals,
        'away_goals': away_goals,
    })
series_df = pd.DataFrame(minute_rows)
print(series_df.head())

A similar approach can be used for live coefficients: you define the target frequency (for example, once every 30 seconds), interpolate or aggregate changes from oddsBase and obtain a single time series that can be fed into the model. The more standardized your sequences are, the easier it is to scale the solution to different tournaments and sports without rewriting the data preparation logic with each change in the API structure.

Tools and libraries for cleaning and normalizing sports data through the API

To build an industrial pipeline for processing sports data, it is important to choose the right stack of tools. On the information retrieval side through the Sport Events API, lightweight HTTP clients (requests in Python, axios in JavaScript) are sufficient, while for storage — columnar DBMS or data warehouses. The main volume of work falls on the transformation layer, where libraries for cleaning, normalizing, and validating are applied. Most often in sports analytics, pandas, NumPy, and scikit-learn are used for classical ML, as well as specialized frameworks for time series and deep learning.

Special attention should be paid to orchestration and monitoring. Tools like Airflow or Prefect help regularly run tasks for data extraction from the API, cleaning, normalization, and loading into the feature store. On the api-sport.ru service side, new capabilities are actively being developed: support for WebSocket is planned for receiving live streams in real time and AI tools that will allow automating some preprocessing and data enrichment stages directly at the API level.

Example of a simple ETL pipeline in Python using the Sport Events API

Below is a minimal example of an ETL task that retrieves matches by date, cleans basic fields, and saves the result in CSV. Such a script can easily be wrapped in an orchestrator and run daily, providing a steady stream of normalized data for models.

import requests
import pandas as pd
API_KEY = 'ВАШ_API_КЛЮЧ'
BASE_URL = 'https://api.api-sport.ru/v2/football/matches'
def load_matches(date):
    headers = {'Authorization': API_KEY}
    params = {'date': date, 'status': 'finished'}
    r = requests.get(BASE_URL, headers=headers, params=params)
    r.raise_for_status()
    return r.json().get('matches', [])

def transform(matches):
    rows = []
    for m in matches:
        rows.append({
            'match_id': m['id'],
            'date': m['dateEvent'],
            'home_team': m['homeTeam']['name'],
            'away_team': m['awayTeam']['name'],
            'home_score': m['homeScore']['current'],
            'away_score': m['awayScore']['current'],
        })
    df = pd.DataFrame(rows).drop_duplicates(subset=['match_id'])
    return df

if __name__ == '__main__':
    matches = load_matches('2025-09-03')
    df = transform(matches)
    df.to_csv('matches_clean.csv', index=False)

Over time, you can supplement such a pipeline with individual steps: enrichment from estadísticasDelPartido, feature normalization, time series standardization, and data quality logging. By using a unified API and a well-structured tool stack, you will create a scalable infrastructure capable of supporting both offline model training and online inference in products related to betting, recommendation systems, and sports analytics.