High throughput screening (HTS) is one of the leading techniques for hit identification in drug discovery and is often done in two phases, primary and confirmatory. The resulting data is multi-fidelity, with noisy primary screening data available on a large number of compounds and higher quality confirmatory data on a low-to-moderate number of compounds. Existing computational pipelines do not integrate primary and confirmatory screening data of individual HTS campaigns, resulting in millions of potentially useful screening data points being unused in models of confirmatory bioactivity prediction. Furthermore, there is currently a lack of publicly available multi-fidelity bioactivity benchmarks to support modelling real-world high-throughput screening data. To address these challenges, we first compiled a public collection of 23 multi-fidelity HTS datasets from PubChem for benchmarking, including more than 6.1 million data points. Additionally, we assembled a private collection of 19 AstraZeneca HTS datasets, spanning more than 22.8 million data points. We then designed and evaluated machine learning models to assess the improvements offered by the integration of multi-fidelity data, including classical machine learning and novel deep learning approaches, the latter based on graph neural networks. Jointly modelling primary and confirmatory data led to a decrease of 12% in mean absolute error (MAE) and an increase of 152% in R-squared on the public datasets, and a reduction of 17% in MAE coupled with an uplift of 46% in R-squared on the AstraZeneca datasets (averaged across all evaluated methods). Furthermore, supplementing with molecular embeddings produced by previously trained deep learning models led to improved metrics for compounds that were not part of the primary screen, with up to double the baseline performance. We conclude that joint modelling of multi-fidelity HTS data improves predictive performance and that deep learning enables the use of unique and highly desirable strategies such as leveraging signals from multi-million scale datasets and transfer learning.