Traditional crop-based methods (a) focus on learning crop templates for better composition. However, when scenes contain chaotic arrangements of subjects, cropping alone rarely yields satisfactory results. Perspective transformation (b) addresses these challenges by adjusting spatial relationships between subjects (e.g., person and tree, red arrow) and scene orientation.
Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from suboptimal to optimal perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
For single subjects scenarios, PPC enhances compositional by seamlessly integrating subjects with their surroundings.
For multi-subject scenes, PPC achieves balanced spatial arrangements to elevate overall visual aesthetics.
For landscape photography, PPC particularly enhancing balance and horizontal alignment.
We discovered the applicability of PPC to UAV photography. PPC successfully identifies optimal views from drone-like perspectives, generating camera movements that adhere to compositional principles while maintaining aesthetic appeal.
When presented with different suboptimal views of the same scene, PPC generates consistent optimal perspectives, maintaining coherence across different inputs.
Our pipeline takes a suboptimal perspective as input and generates a transformation video from the suboptimal to optimal perspective. This process can be modeled as an image-to-video (I2V) task. We utilize the last frame of the video as our final optimal perspective and design a method to guide human actions. First, we draw a guidance box (the red bbox) on the optimal perspective. Then, based on this box, along with the initial and final perspectives, we transform this box onto the original image using feature matching, creating a distorted box. As the user moves, this box gradually changes shape, approaching a rectangle when reaching the true optimal perspective. To simplify the process and accelerate computation, we only use traditional homography transformation. Additionally, we propose incorporating direct preference optimization (DPO) to align the model with human preferences. This approach encourages the exploration of aesthetically pleasing trajectories that may differ from GT, avoiding the limitation where GT-based optimization could discourage potentially superior compositional alternatives.
(1) Data Source. We select multiple professional photography datasets, including datasets used in existing composition studies such as GAIC, SACD, FLMS, and FCDB. Furthermore, to expand our data volume, we incorporated Unsplash, currently the largest open-source professional photography dataset. (2) Perspective Transformation Generation. We adopt a 3D reconstruction approach. Our 3D reconstruction methodology mainly builds upon the ViewCrafter. The inputs consist of a well-composed image and a specified camera motion trajectory. Note that this trajectory can be random. By following this trajectory, we can generate a video sequence transitioning from the optimal to suboptimal perspective. Then, by reversing this video sequence, we obtain our desired training data. (3) Data Filtering. Given the limited performance of current reconstruction models, the generated video data needs to filter out artifacts including distortion, fixedness, and blur effects. However, manual filtering for such a large dataset is impractical. Our tests showed that a single person can only filter about 3K videos per day, making it difficult to process large-scale samples. With the rapid advancement of vision language models (VLMs) in scene understanding and automated evaluation, we develop a perspective quality assessment (PQA) model to filter the generated data.