Smartphone localization is essential to a wide range of applications in shopping malls, museums, office buildings, and other public places. Existing solutions relying on radio fingerprints and/or inertial sensors suffer from large location errors and considerable deployment efforts. We observe an opportunity in the recent trend of increasing numbers of security surveillance cameras installed in indoor spaces to overcome these limitations and revisit the problem of smartphone localization with a fresh perspective. However, fusing vision-based and radio-based systems is non-trivial due to the absence of absolute location, incorrespondence of identification and looseness of sensor fusion. This study proposes iVR, an integrated vision and radio localization system that achieves sub-meter accuracy with indoor semantic maps automatically generated from only two surveillance cameras, superior to precedent systems that require manual map construction or plentiful captured images. iVR employs a particle filter to fuse raw estimates from multiple systems, including vision, radio, and inertial sensor systems. By doing so, iVR outputs enhanced accuracy with zero start-up costs, while overcoming the respective drawbacks of each individual sub-system. We implement iVR on commodity smartphones and validate its performance in five different scenarios. The results show that iVR achieves a remarkable localization accuracy of 0.7m, outperforming the state-of-the-art systems by >70%.