Diese Seite ist nur auf Englisch verfügbar.

Nordamerika & Stadtgeographie Data cleaning and big data validation for public transport system

  • Duration: June 2025
  • Program: C-NEWTRAL Doctoral Network funded by the EU’s Horizon Europe Marie Skłodowska-Curie Actions programme (Grant Agreement No. 101119603)
  • Staff: Alireza Aboutalebi Adergani
Ifeu Büro

The non-academic secondment at the Institute for Energy and Environmental Research significantly enhanced the methodological capacity of the research, particularly in the development of data processing and validation tools for large-scale mobility datasets. The focus of this work was the design and implementation of a custom-built analytical environment for data cleaning, validation, and comparison of public transport datasets. Given the heterogeneous nature of transport data sources, inconsistencies, missing values, and structural mismatches frequently arise, requiring systematic processing before meaningful analysis can be conducted. To address this, an integrated data analysis and comparison tool was developed. The tool enables the aggregation of multiple datasets, calculation of key operational indicators (e.g., service distance, routes, service hours), and structured comparison across different data inputs. It should be noted that the datasets used in this work were primarily provided by a single company and followed company-specific formats and structures. The tool has not yet been applied in a fully generalized setting for processing and integrating large-scale datasets from multiple companies simultaneously. Instead, its development and testing were based on datasets obtained from this specific source. The tool also facilitates the identification and exploration of inconsistencies and anomalies, thereby improving data reliability and analytical transparency. In addition, it incorporates interactive geospatial visualization components, allowing for spatial inspection of transport networks and supporting the validation of processed datasets against real-world spatial structures. This integration of data cleaning, statistical aggregation, and spatial visualization provides a robust framework for handling big mobility data in applied research contexts.

The development of this analytical environment was supported by using AI-assisted coding practices. Prompt-based interaction with large language models was used to accelerate code development, debug workflows, and refine data-processing logic. This approach proved highly effective in reducing development time while maintaining flexibility and enabling iterative improvement of the analytical tool.

Figure 1: Integrated data analysis, validation, and visualization interface for public transport datasets. The tool enables multi-dataset comparison, supports anomaly inspection, and allows spatial analysis of transport networks. Note: Numerical values and certain operational indicators have been partially anonymized in accordance with company data protection policies.