Back to Project

Monthly Updates

Tracking the progress, challenges, and milestones of my Final Year Project.

March 2026

March: RL Pipeline Optimization & Control Strategy Redesign

Work Carried Out

In March, I focused on improving the stability and efficiency of the reinforcement learning (RL) pipeline. During the initial phase, I debugged and tested the custom RL environment, followed by optimizing training performance by reducing reset times and improving overall efficiency. Several test training runs were conducted to evaluate system behavior.

I also refined the reward function through reward shaping and reviewed relevant RL literature to guide improvements. Initial training sessions were carried out with hyperparameter tuning to assess performance and learning stability.

Problems Encountered

A key limitation was identified in the control strategy. The residual torque-based approach was ineffective, as RL signals could not match the high update rate (1 kHz) of the PX4 Autopilot rate controller.

Solutions / How Issues Were Addressed

To resolve this, the control strategy was redesigned to inject residual angular rate commands at an earlier stage in the PX4 control stack. This improved compatibility between the RL controller and the flight control system, leading to more effective control behavior.

February 2026

February: RL Environment Development & Training Optimization

Work Carried Out

In February, I focused on developing the reinforcement learning training environment using the Gymnasium library. A custom RL environment was implemented, including the design of the state space representation and core environment functions such as step, reset, and reward calculation. The environment was structured to ensure stable interaction between the learning agent and the PX4 Autopilot-controlled hexacopter simulation.

Additionally, I initiated the development of a Curriculum Manager to support curriculum learning, enabling progressive adjustment of task difficulty during training. This framework is intended to allow staged learning, where complexity increases as the agent’s performance improves, enhancing policy convergence and stability.

Problems Encountered

A key challenge arose from the PX4 baseline controller running as a separate process that must remain active during training. This created issues during environment resets, particularly after early crash scenarios.

Solutions / How Issues Were Addressed

To address this, health checks were disabled to prevent simulation failsafes, and the reset mechanism was redesigned to respawn only the hexacopter without restarting PX4. This significantly reduced reset time and improved overall training efficiency.

January 2026

January: Residual RL Integration & Real-World System Validation

Work Carried Out

In January, I focused on refining the residual signal injection mechanism for the reinforcement learning controller, ensuring seamless integration with the baseline PX4 Autopilot while maintaining flight stability. This step was critical to allow learned control actions to augment the existing controller without destabilizing the system.

I also established a communication bridge between ROS 2 and PX4 using the px4_msgs package and Micro XRCE-DDS middleware. This enabled reliable real-time data exchange between PX4 and the ROS 2 node, forming a crucial component of the reinforcement learning training pipeline.

Problems Encountered

A key issue arose when attempting to reduce sensor noise as suggested by the supervisor. When noise levels were reduced too aggressively, PX4 refused to arm the hexacopter. The unrealistically clean sensor data triggered PX4’s safety checks, which interpreted it as faulty or disconnected sensors.

Solutions / How Issues Were Addressed

This issue was resolved by carefully tuning the sensor noise parameters. Noise levels were reduced to the minimum values that still satisfied PX4’s arming checks. Through iterative tuning, a balance was achieved that provided cleaner data for reinforcement learning while remaining realistic enough for PX4 pre-flight validation.

January 2026

December: Aerial Manipulator Integration & Simulation Validation

Work Carried Out

In December, I focused on validating and extending the hexacopter and aerial manipulator simulation. The hexacopter SDF model was validated through motor tests in Gazebo, where stable hover was achieved at approximately 459 rad/s per motor with the current mass and configuration, confirming that the simulation closely matches the real platform.

I then integrated the 3-DoF manipulator with the hexacopter in simulation and tested the combined aerial manipulator system. These tests revealed significant dynamic disturbances on the hexacopter caused by manipulator motion, highlighting important coupling effects that must be addressed in future control design.

In parallel, I worked on the planning and design considerations for a residual reinforcement learning controller. This included defining the action space, observation space, and reward structure, and outlining how learned residual actions would augment the baseline PX4 controller while preserving stability and supporting future sim-to-real transfer. No RL training or implementation was carried out at this stage.

I also contributed to the mid-review presentation and report documentation, summarizing simulation results, system integration progress, and planned control developments.

Problems Encountered

A key issue arose from the model spawning mechanism. The hexacopter was spawned through PX4 SITL, while the manipulator was spawned separately by Gazebo, leading to difficulties when attaching the manipulator at the world level and resulting in incorrect model linkage and dynamics.

Solutions / How Issues Were Addressed

This was resolved by integrating the manipulator directly into the hexacopter SDF model, ensuring that both the hexacopter and manipulator were spawned together under PX4 SITL. This provided correct kinematic and dynamic coupling and enabled reliable combined-system simulation and testing.

November 2025

November: Residual RL Control Setup and SDF-to-URDF Conversion

Work Carried Out

In November, I worked on the control and simulation setup for the reinforcement learning–based aerial manipulator. I reviewed research on residual reinforcement learning and decided to use a residual RL approach that combines the existing PX4 controller with a learned RL component. This keeps the stability of the classical controller while allowing RL to fine-tune the behaviour, which supports smoother transfer from simulation to the real hexacopter.

I finalized the SDF model of the hexacopter to closely match the real platform and created an equivalent URDF model for use in ROS 2 and related tools. I also updated the PX4 SITL firmware to support residual control by adding the necessary residual control topics, such as residual thrust and residual torque, for the future RL agent.

In addition, I performed an open-loop motor test and verified hover at approximately 459 rad/s, which validated rotor parameters and supported simulation tuning.

In parallel, I evaluated cost-effective compute options with enough resources for Gazebo simulation and RL training. This included comparing GPU-enabled VMs and workstation options based on capability and pricing.

Problems Encountered

The main difficulty was converting the detailed hexacopter SDF into a valid URDF, as existing SDF-to-URDF conversion tools often failed or produced incorrect models.

Solutions / How Issues Were Addressed

I used GitHub Copilot together with a step-by-step divide-and-conquer approach. I manually converted links, joints, and inertia parameters from SDF to URDF in small sections, checking and fixing errors incrementally until the URDF loaded correctly in the required tools.

October 2025

October: Hexacopter Model Refinement and PX4 SITL Integration

Work Carried Out

In October, I refined the hexacopter SDF model to closely represent the real platform. Each link and joint was positioned to match the physical frame, and I calculated and documented inertia matrices for all links to improve dynamic realism. The updated model was integrated with the PX4 SITL setup and verified through multiple test runs to confirm stability and responsiveness.

I also identified and corrected an issue in rotor modeling. Previously, maximum rotor speeds were estimated without considering motor current and power limits. After incorporating these constraints, I recalculated the thrust and torque constants (kf and mc) and updated the PX4 airframe configuration to keep the simulation consistent with the physical system.

Content image

Problems Encountered

The main challenge was determining accurate inertia matrices for the hexacopter links without access to a detailed CAD model.

Solutions / How Issues Were Addressed

I derived link inertias using analytical approximations of symmetric 3D shapes. I aligned coordinate frames with the principal inertia axes, which allowed the inertia matrices to be simplified to diagonal form while maintaining physical realism. This improved simulation stability and produced more realistic dynamics.

September 2025

September: Hexacopter Simulation Setup and Parameter Selection

Work Carried Out

In September, I followed a structured course on ROS 2 and Gazebo simulation to support the development of our aerial manipulator platform. I reviewed an open-source hexacopter SDF repository to understand best practices for structuring multicopter models, sensor plugins, and rotor dynamics configuration. Since our platform is a custom hexacopter with different geometry and parameters, I did not reuse the provided models directly. Instead, I used the repository mainly as a learning reference to guide how to build and organize our own hexacopter SDF.

I then developed our custom hexacopter SDF for Gazebo Harmonic and added sensor plugins for GPS, barometer, and IMU. Rotor dynamics were implemented using the MulticopterMotorModel plugins. I successfully spawned the hexacopter in Gazebo Harmonic and tested its basic stability and hover behavior.

In addition, I derived motor and rotor parameters based on SunnySky X2216 KV1250 motors with 11×5.7-inch propellers. I documented the estimated values and assumptions to support integration and future testing.

Problems Encountered

The reference SDF model I studied was built for Ignition Gazebo, so I encountered multiple compatibility errors while developing our custom SDF for Gazebo Harmonic. Many plugin and model components required changes to work correctly in the Harmonic setup.

Solutions / How Issues Were Addressed

I referred to official documentation and resolved issues step by step by building and testing the model in smaller parts. By carefully adjusting each plugin and component and validating behavior incrementally, I was able to assemble a working custom SDF that runs in Gazebo Harmonic.

August 2025

August: Reward Shaping and Improved Waypoint Tracking

Work Carried Out

In August, I improved the initial RL implementation and refined the policy through reward shaping to achieve smoother waypoint following. I updated the setup so the quadrotor could follow waypoints continuously without stopping at intermediate points by modifying the state space and reward function.

The state space was expanded to a 20-dimensional vector using scaled position, velocity, attitude, and body rates, together with the relative position to the current and next waypoint and a desired final yaw target. To improve generalization, waypoint trajectories were generated from line, curve, and helix paths.

The reward design was updated to encourage tighter tracking and better final behavior. This included a stronger position-error penalty while keeping velocity and rate penalties for smooth motion. A large reward was given at each waypoint. After reaching the final waypoint, additional rewards encouraged a smooth stop with low velocity, alignment to the target yaw, and maintaining a stable, level attitude for a short hold period. Crash and out-of-bounds penalties were kept unchanged.

I trained multiple policies and analyzed reward progression to identify the best-performing model. I also plotted key dynamic parameters such as position, angular velocity, roll, pitch, yaw, linear acceleration, forces, and torques to better understand system behavior.

Content image

3D flight path while tracking a single waypoint using the best-performing RL checkpoint (dashed: executed path history).

Content image

Time-series of quadrotor states and control outputs for the same single-waypoint flight path shown above, generated using the best performing RL checkpoint.

Problems Encountered

The main challenge was selecting the best trained policy among multiple training runs.

Solutions / How Issues Were Addressed

I trained up to 15 million time steps and saved checkpoints every 100,000 steps. Using TensorBoard, I compared cumulative reward trends and selected the policy with the highest overall performance and most consistent improvement.

July 2025

July: RL Foundations & Initial PPO Waypoint Tracking

Work Carried Out

In July, I focused on building a strong theoretical and practical foundation in Reinforcement Learning (RL). I followed a YouTube series on RL fundamentals and completed an Udemy course to strengthen core concepts and implementation details. I also reviewed four research papers on RL-based control for aerial manipulators to understand prior work, common challenges, and evaluation practices.

In parallel, I studied quadrotor dynamics and implemented custom RL environments using the Gymnasium library. I developed an initial quadcopter environment and trained a PPO (Proximal Policy Optimization) agent to perform a simple waypoint-following task in a Python-based simulation.

As part of this implementation, I built an RL-based controller for autonomous navigation with a focus on training stability. The environment uses scaled observation states to improve convergence during PPO training. The state space was defined as a 17-dimensional vector including position, velocity, attitude (quaternion), body rates, relative position to the current waypoint, and a final-waypoint flag.

The reward function was designed to encourage stable and goal-directed motion. It included a distance-to-waypoint penalty, velocity and body-rate penalties to promote smooth flight, and a time-step penalty to encourage faster completion. Progress rewards and waypoint-arrival bonuses were added to guide the agent through the path, along with a final stopping bonus based on low linear and angular speed. Strong penalties were applied for crashes to discourage unsafe behavior.

Additionally, I researched different simulation tools and their requirements, including AirSim, Isaac Sim, and PyBullet, to evaluate potential alternatives to Gazebo for future integration and testing.

Problems Encountered

The main challenge was the absence of a complete simulation environment for realistic testing. This slowed down early experimentation and iteration.

How I Addressed the Issues

As a temporary solution, I used a Python-based quadrotor simulator from GitHub to validate and debug the PPO training pipeline. This enabled faster iteration and helped confirm that the environment design and training setup were working correctly.