This paper quantifies the extent to which infants can perceive audio–visual congruence for speech information and assesses whether this ability changes with native language exposure over time. A hierarchical Bayesian robust regression model of 92 separate effect sizes extracted from 24 studies indicates a moderate effect size in a positive direction (0.35, CI [0.21: 0.50]). This result suggests that infants possess a robust ability to detect audio-visual congruence for speech. Moderator analyses, moreover, suggest that infants’ audio–visual matching ability for speech emerges at an early point in the process of language acquisition and remains stable for both native and non-native speech throughout early development. A sensitivity analysis of the meta-analytic data, however, indicates that a moderate publication bias for significant results could shift the lower credible interval to include null effects. Based on these findings, we outline recommendations for new lines of enquiry and suggest ways to improve the replicability of results in future investigations.