Using machine learning to overcome mosquito collections missing data for malaria modeling
Using machine learning to overcome mosquito collections missing data for malaria modeling
Rubio-Palis, Y.; Feng, L.; Liang, K. S.; Song, C.; Wang, S.; Duchnicki, T.; Zhang, X.; Bravo de Guenni, L.
AbstractEntomological surveillance plays a crucial role in areas where malaria remains endemic, yet gathering data on mosquito populations is often expensive and complicated, particularly in remote locations with challenging logistics and inconsistent sampling schedules. Access to extensive time series data on mosquito species at specific sites would greatly enhance insights into seasonal trends and the biting habits of vectors of malaria parasites. Gaps in mosquito count records pose a significant challenge for researchers and public health officials seeking to establish early warning systems and effective vector control programs. In this study, we apply quantitative machine learning techniques to address missing data in estimates of mosquito abundance collected from 2009 to 2016 in Bolivar State, Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of Anopheles mosquitoes, employing a leave-one-out cross-validation strategy. Additionally, we developed a predictive malaria transmission model incorporating mosquito abundance and climate variables (El Nino 3.4 Index, rainfall, and mean air temperature) as covariates. Our generalized time series model forecasts malaria incidence of Plasmodium vivax and Plasmodium falciparum based on climate dynamics and imputed mosquito data. Model performance was assessed using root mean square error, mean absolute error, and mean absolute percentage error. The final results demonstrated that machine learning imputation significantly improved the accuracy and reliability of P. vivax malaria incidence predictions but failed to predict P. falciparum incidence. The study demonstrates that method choice significantly influences the reconstruction of seasonal abundance patterns and the performance of malaria incidence models. Nevertheless, the proposed models strengthen the foundation for targeted interventions and surveillance in endemic regions. Despite limitations in data continuity and coverage, the findings highlight the value of combining multiyear entomological data sets with robust imputation and sensitivity analyses to improve predictive modeling in resource-constrained, malaria-endemic settings.