EPJ Data Science

journal article

Open Access Collection

Analyzing parliamentary voting dynamics using multiple aspects trajectory clustering approach

Santos, Yuri; Portela, Tarlis; Torres, Marcus; Cardoso Silva, Jonathan; Tyska Carvalho, Jônata

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-025-00609-ypmid: N/A

Multiple aspects trajectory (MAT) is a relevant concept that enables mining useful patterns and behaviors of moving objects for different applications. As a new way of looking at trajectories, MAT includes a semantic dimension, and thus presents the notion of aspects that are relevant facts of the real world that add more meaning to spatio-temporal data. Considering the possibilities of this new algorithmic paradigm, we decided to test it on political data. More specifically, we look at legislative voting behavior to understand political alignment, coalition dynamics, and governance patterns. Traditional data mining approaches do not capture the temporal motifs of parliamentary voting patterns. We address this gap by employing the MAT-Tree algorithm, a hierarchical clustering method for multiple aspects trajectories, to analyze twenty years of voting data of the Brazilian Chamber of Deputies. We aim to reveal hidden patterns, such as voting similarities and alignments, by analyzing the data from the perspective of multiple aspects, thereby enabling a multidimensional analysis of voting patterns. The experimental results demonstrate that MAT-Tree identifies cohesive voting blocks, shifts in legislative support, and outlier behaviors across different political periods. Furthermore, the analysis reveals critical patterns, including increased polarization in post-impeachment periods and evolving dynamics between government and opposition. Thus, these findings highlight the potential of MAT clustering with MAT-Tree as a robust tool for political analysis, providing a scalable framework for exploring multidimensional datasets that go beyond mobility data.

journal article

Open Access Collection

Recovering scheduling preferences in dynamic departure time models

Yang, Zhenyu; Giardina, Pietro; Gerolimnis, Nikolas; de Palma, André

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-025-00608-zpmid: 41969366

We aim to infer commuters’ scheduling preferences from their observed arrival times, given an exogenous traffic congestion pattern. To do this, we employ a structural model that characterizes how users balance congestion costs against the penalties for arriving early or late relative to an ideal time. In this framework, each commuter selects an arrival time that minimizes her overall trip cost by considering the within-day congestion pattern along with her individual scheduling preference. By incorporating the distribution of these preferences and desired arrival times across the population, we can estimate the likelihood of observing arrivals at specific times. Using synthetic data, we then apply the maximum likelihood estimation (MLE) method to recover the parameters of the joint distribution of scheduling preferences and desired arrival times. Our numerical results demonstrate the effectiveness of the proposed method.

journal article

Open Access Collection

Connective action and digital repression during China’s COVID-19 protests: a computational analysis of multilingual coordinated activity on Twitter

Kulichkina, Aytalina; Balluff, Paul; Righetti, Nicola; Waldherr, Annie

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00637-2pmid: N/A

In authoritarian contexts, social media serve as critical platforms for coordinating both protest and repression. This study centers on the unprecedented COVID-19 protests in the People’s Republic of China, which were extensively tweeted and suppressed through contesting narratives. We explore prominent themes, temporal dynamics, and linguistic patterns of coordinated communication during these events. Using a coordination detection algorithm, we identified 13,557 Twitter accounts involved in 739,819 instances of coordinated sharing during the protests. We then applied topic modeling to categorize the coordinated tweets into topics supporting either the protests or repression. Drawing on the theory of authoritarian publics, we classified protest-supporting topics into three categories: leadership-critical, policy-critical, and descriptive. Similarly, building on the digital repression typology, we categorized repression-supporting topics into government propaganda, distracting information, and demoralizing content. Within protest-supporting content, policy-critical tweets were the most widely shared across three analyzed languages. Leadership-critical tweets were more prominent in traditional Chinese, while descriptive tweets were more common in simplified Chinese. Repression-supporting content was most prevalent in English, followed by simplified Chinese, with demoralizing and distracting information dominating discourse. Government propaganda was the least frequent and appeared primarily in simplified Chinese. Community detection revealed that 85.4% of coordinated tweets were amplified by ten major communities, each organized around a single language and goal—either supporting protests or promoting repression. By combining multiple computational approaches, this study offers a comprehensive framework for content-centered analysis of online protest-repression dynamics and contributes to our understanding of connective action and digital repression in authoritarian contexts.

journal article

Open Access Collection

Stigmergic influence of simple bots on human cooperation in digital environments

Bassanetti, Thomas; Cezera, Stéphane; Delacroix, Maxime; Escobedo, Ramón; Blanchet, Adrien; Sire, Clément; Theraulaz, Guy

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00653-2pmid: 42164078

In the digital era, human cooperation is increasingly mediated by indirect social cues such as ratings, reviews, and other digital traces left in online environments. These traces often guide collective behavior via stigmergy, a coordination mechanism whereby individuals interact through modifications of a shared environment. In this study, we explore how simple model-driven bots can influence human cooperation or defection in a competitive rating game inspired by online marketplaces. Participants, unaware of the bots’ presence, interacted with either four human partners or four bots exhibiting predefined behaviors—cooperative, neutral, deceptive, or optimized for group performance. We show that the presence and behavior of bots significantly affect human strategies and performance. Higher levels of cooperation among bots improve human outcomes but also increase the frequency of deceptive human strategies, suggesting exploitation of reliable social information. Conversely, in less cooperative environments, participants adopt more collaborative or neutral behaviors to preserve informational value. By classifying individuals into three behavioral profiles—collaborators, neutrals, and defectors—we develop a linear regression model using three cues: the average value of rated cells, the diversity of rated cells, and the player’s rank. These cues allow accurate prediction of behavioral profile distributions across experimental conditions. An adaptive agent-based model further reproduces the empirical results. Our findings demonstrate that even simple bots can strongly influence collective dynamics in human groups. These insights have implications for the design of recommendation systems, the regulation of automated agents, and the understanding of cooperation and deception in digital societies.

journal article

Open Access Collection

Citizen design science – towards generative and responsive cities

Schmitt, Gerhard

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00626-5pmid: N/A

Urban development faces multifaceted challenges, including climate change, mobility transitions, resource depletion, social inequality, and demographic shifts. To address these, cities must become responsive and generative, placing citizens at the center of transformation. Traditional top-down planning often fails to leverage citizen participation and overlooks their growing familiarity with digital and AI tools. This study introduces Citizen Design Science, a collaborative methodology that integrates citizen engagement with advances in data science, AI, and design science. By combining participatory design, computational instruments, geospatial analytics, simulation, and real-time data, the approach empowers both experts and non-experts to shape resilient and livable human settlements. Case examples from education, research, culture, and urban planning demonstrate how Citizen Design Science democratizes development and fosters inclusive, scientifically grounded processes. The methodology emphasizes citizen empowerment, technological integration, and collaborative governance across scales, from villages to megacities. Key challenges remain, including time-intensive engagement, digital accessibility, shared human-AI governance, data quality, and the digital divide. Overcoming these obstacles is essential for scaling impact and ensuring resilient, livable settlements.

journal article

Open Access Collection

Understanding the spatial and temporal impact of global events through large-scale social media data

Meyer, Ann-Kathrin; Brandt, Tobias

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00657-ypmid: N/A

Large-scale urban social media data can provide substantial insights into the real-time development of cities around the globe, illuminating phenomena such as gentrification, urban decay, and resilience to major adverse events. This study utilizes a dataset of over 147.8 million georeferenced tweets from multiple cities to demonstrate their potential for analyzing the emotional and temporal impacts of major events, including the U.S. presidential elections and the Covid-19 pandemic. By employing a sentiment indicator and an anxiety indicator, we highlight the importance of establishing robust baselines that are not only city-specific but also long-term, population-based, and user-based. We demonstrate the value of integrating georeferenced data with long-term analysis to uncover spatial and temporal patterns in public emotional responses, offering new perspectives on the dynamics of crises, such as climate change, and societal resilience.

journal article

Open Access Collection

A row-type specific hybrid framework for credit risk analysis: loan portfolio based feature selection and unsupervised Bayesian network dependency exploration

Rath, Minati; Date, Hema

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00655-0pmid: N/A

Credit risk assessment is a critical function in financial analytics, requiring models that can adapt to diverse borrower profiles while providing clear and interpretable insights. Although a range of data driven techniques have been applied in this domain, many struggle to handle the inherent heterogeneity of financial data across different loan categories such as personal and agricultural loans. This paper introduces Credit Risk Analysis with Bayesian Networks (CRAB-Net), a row type specific hybrid framework for credit risk modeling. The approach first segments the data by loan type and balances the distribution of risk categories to ensure fair representation. It then identifies the most outcome relevant attributes through targeted feature selection, focusing on variables most associated with credit risk differentiation. On this refined set of features, unsupervised Bayesian network learning is applied to uncover conditional dependencies among financial variables without relying on default outcome labels. This design combines supervised relevance filtering with unsupervised dependency discovery, reducing noise and avoiding misleading patterns from analyzing all features indiscriminately. The framework revealed that in personal loans, installment-related variables such as installment frequency, overdue status, and repayment structure emerged as central nodes, indicating their dominant role in defining repayment behavior and delinquency risk. In contrast, for agricultural loans, the network structure was shaped primarily by provisioning norms, landholding details, and exposure-related attributes such as sanctioned amount and collateral type, suggesting that borrower risk in this segment is more closely linked to regulatory classification and collateral strength. Experiments on real world banking data show that CRAB-Net provides interpretable dependency graphs, supports fair segment level analysis, and enhances transparency for audit and supervisory compliance under Basel norms by offering clear, data driven evidence of the risk factors shaping borrower outcomes.

journal article

Open Access Collection

Beyond the tax haven: a graph analysis of business attraction in Swiss municipalities

Capozzi, Arthur; Dailisan, Damian

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00619-4pmid: N/A

Switzerland’s decentralized fiscal structure has long been anecdotally credited with creating intense tax competition among its municipalities, famously attracting businesses to cantons like Zug. This research proposes a data-driven analysis of the factors that influence the business landscape in 226 Swiss municipalities from 2011 to 2022. By leveraging a rich collection of spatio-temporal open datasets, we build a predictive model of business creation and use explainable AI techniques to uncover the key socioeconomic drivers of municipal attractiveness. Our core methodology uses machine learning models, particularly graph neural networks (GNNs), to learn and capture the complex interdependencies between municipalities. Here, a GNN using attention mechanisms performs the best with a median R2=0.832\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$R^{2}=0.832$\end{document} when using business sector demographics, population, municipal expenditure, and tax rate feature sets. Combining the trained models with explainable AI, we find that the most important features are coming from the business statistics datasets, rather than the tax data. However, a more granular analysis of municipalities grouped by primary language shows a different set of important features, highlighting the importance of a contextual, localized approach rather than a one-size-fits-all analysis. This study will provide a nuanced understanding of the interaction between tax policies, demographics, infrastructure, and other factors in shaping Switzerland’s economic geography.

journal article

Open Access Collection

Mapping violence perceptions through YouTube comments: a new approach to real-time violence monitoring

Amarasinghe, Ashani; Nanlohy, Sascha; Morgan, Thomas; Hammond, David; Dahiya, Yashdeep; Bailo, Francesco

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-026-00649-ypmid: N/A

This paper introduces the Violence Perception Index (VPI), a novel methodology for quantifying violence-related discourse through geolocated YouTube comments. Utilizing the YouTube API and natural language processing techniques, the VPI measures public references to violence across 1.2 million unique geolocated videos in Mexico (2020–2024), extracting 14.8 million comments from over 500,000 videos with user engagement. This approach provides spatiotemporally granular data on violence-related discourse, which we treat as a proxy for violence perceptions, extending beyond traditional event-based datasets by capturing not only documented violence but also rumors, fears, and community discourse about violence, dimensions that influence community behavior and social stability independently of official records. Violence scores are constructed using a weighted Spanish-language dictionary developed through semantic network expansion from violence-related seed terms. The dictionary-based scoring approach demonstrates moderate-to-substantial agreement with large language model classifications across 700 stratified comments (75-81% agreement), validating the method’s capacity to systematically identify violence-related discourse at scale while maintaining computational efficiency for processing millions of comments. The VPI is benchmarked against established violence indicators including ACLED fatalities and official municipal homicide statistics through panel regression specifications incorporating comprehensive spatial and temporal fixed effects. Analysis reveals systematic geographic heterogeneity: the VPI correlates strongly with ACLED data in high-population areas but exhibits stronger correlation with official homicide records in low-population contexts. Rather than constituting a methodological limitation, this pattern demonstrates the VPI’s enhanced sensitivity in marginalized and remote regions where news-based datasets suffer from systematic reporting bias. The methodology is immediately scalable across languages and geographies, providing complementary intelligence for conflict monitoring, early warning systems, and policy interventions in precisely those underrepresented areas where traditional event-based monitoring systems provide incomplete coverage.

journal article

Open Access Collection

Impact of federated data with local differential privacy for human mobility modeling

Gibbs, Hamish; Musolesi, Mirco; Cheshire, James; Eggo, Rosalind M.

2026 EPJ Data Science

doi: 10.1140/epjds/s13688-025-00611-4pmid: N/A

With increasing awareness of the privacy risks posed by mobile phone location data, researchers need ways to use mobility data while offering stronger privacy guarantees to the individuals included in this data. A promising approach to this challenge is the creation of privacy-preserving mobility insights from decentralized location data using Local Differential Privacy (LDP). However, mobility data generated with LDP, based on the introduction of noise by individual mobile devices, is limited by the volume of noise required to achieve individual privacy. In this paper, we provide a fully reproducible model of the accuracy of mobility networks generated with LDP compared to mobility network data generated with more traditional privacy mechanisms: Central Differential Privacy (CDP) and K-anonymity. Using a simulated mobile phone mobility dataset informed by real-world travel patterns in the USA, we explore the trade-off between privacy and data utility provided by different parameters in a federated system with LDP. We also explore the impact of spatial and temporal aggregation on data accuracy, showing that long-standing considerations regarding the appropriate units of analysis for geographic data play a key role in determining the utility of federated mobility data with LDP. Our paper facilitates an in-depth understanding of the trade-offs between privacy and data utility entailed by the future adoption of a federated approach which uses LDP to generate insights from decentralized mobility data.

Showing 1 to 10 of 50 Articles

Articles per page

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Related Journals: