AutoDock4 is a widely used program for docking small molecules to macromolecular targets. It describes ligand- receptor interactions using a physics-inspired scoring function that has been proven useful in a variety of drug discovery projects. However, compared to more modern and recent software, AutoDock4 has longer execution times, limiting its applicability to large scale dockings. To address this problem, we describe an OpenCL implementation of AutoDock4, called AutoDock-GPU, that leverages the highly parallel architecture of GPU hardware to improve the docking throughput up to 170-fold. Moreover, we introduce the gradient-based local search method ADADELTA, which is more efficient than the original Solis-Wets method, especially for conformationally complex ligands. We estimate a 1300x reduction in the number of scoring function calls for ligands with 20 rotatable bonds, and even higher reductions likely for more complex ligands. The improvements reported here, both in terms of docking throughput and search efficiency, expand the domain of applicability of the AutoDock4 scoring function considerably.