Think of that image on the paper as a barcode that the computer can read. When it sees that, it brings up the world. Now, the image is on a 2D surface (the sheet of paper), and is subject to the laws of visual perspective. Assuming the camera is fixed, the rotation of the paper will cause that square to distort in a predictable way based on how it's turned, and how far away from the camera it is. As this changes, the computer has an algorithm that computes the distance and angle of the sheet of paper relative to the video feed, and feeds that into the 3D render which is dropped over the video feed.
Essentially, it's a really complicated green screen from a TV weather studio. The computer recognizes the color (or shape in the GE example) and replaces it with the rendered content.