Showing posts with label 3D Reconstruction. Show all posts
Showing posts with label 3D Reconstruction. Show all posts

11 April 2013

Kinect Fusion with MS SDK

Recently, Microsoft updated their Kinect SDK which supports Kinect Fusion and some interaction interfaces. In this post I will introduce the minimum codes that is required for implementing Kinect Fusion with the Kinect SDK.
Before the code, I'd like to start with an overview of process pipeline for Kinect Fustion. Here is a image from MS website (link):
Figure 1 Process pipeline for Kinect Fusion

In order to make use of the Kinect Fusion api, we need to include NuiKinectFusionApi.h and also link the project to KinectFusion170_32.lib. The files are located in ${Kinect_SDK DIR}/Developer Toolkit v1.7.0/inc/ and ${Kinect_SDK DIR}/Developer Toolkit v1.7.0/Lib/x86/ correspondingly.

I wrote a depthSensor class to manipulate the process more easily. Firstly, declaration of the class is:
class depthSensor
{
private:
 static const int WIDTH = 640;
 static const int HEIGHT = 480;
 static const int UPDATE_WEIGHT = 10;
 int trackingErrorCount;

 INuiSensor*  mNuiSensor;
 // -- 3D volume of reconstruction
 INuiFusionReconstruction* m_pVolume;

 NUI_FUSION_IMAGE_FRAME*     m_pDepthFloatImage;
 NUI_FUSION_IMAGE_FRAME*     m_pPointCloud;
 NUI_FUSION_IMAGE_FRAME*     m_pShadedSurface;

 HANDLE   mNextDepthFrameEvent;
 HANDLE   mNextColorFrameEvent;
 HANDLE   mDepthStreamHandle;
 HANDLE   mColorStreamHandle;

 void initKinectFusion();
 void createInstance(int const kinect_index);
 
public:
 depthSensor();
 void init(int const kinect_index);
 void processDepth(cv::Mat& mat);
 void processKinectFusion(const NUI_DEPTH_IMAGE_PIXEL* depthPixel, int depthPixelSize, cv::Mat& mat);
 const Matrix4& IdentityMatrix();
 ~depthSensor();
};
The most important member of the class is:
 // -- 3D volume of reconstruction
 INuiFusionReconstruction* m_pVolume;
It is the 3D volumetric integration of the scene cumulatively reconstructed by new depth data from Kinect as shown in the Figure 1.
These two member function initialise Kinect instance and setup basic status:
 void createInstance(int const kinect_index);
 void init(int const kinect_index);
At the end of the init() function, we call the initKinectFusion() which initialse the parameters and status relative to Kinect Fusion algorithm. Here is the definition of initKinectFusion():
void depthSensor::initKinectFusion()
{
 HRESULT hr = S_OK;

 NUI_FUSION_RECONSTRUCTION_PARAMETERS reconstructionParams;
 reconstructionParams.voxelsPerMeter = 256; // 1000mm / 256vpm = ~3.9mm/voxel
 reconstructionParams.voxelCountX = 512;  // 512 / 256vpm = 2m wide reconstruction
 reconstructionParams.voxelCountY = 384;  // Memory = 512*384*512 * 4bytes per voxel
 reconstructionParams.voxelCountZ = 512;  // Require 512MB GPU memory

 hr = NuiFusionCreateReconstruction(&reconstructionParams,
          NUI_FUSION_RECONSTRUCTION_PROCESSOR_TYPE_CPU,
          -1, &IdentityMatrix(), &m_pVolume);
 if (FAILED(hr))
  throw std::runtime_error("NuiFusionCreateReconstruction failed.");

 // DepthFloatImage
 hr = NuiFusionCreateImageFrame(NUI_FUSION_IMAGE_TYPE_FLOAT, WIDTH, HEIGHT, nullptr, &m_pDepthFloatImage);
 if (FAILED(hr))
  throw std::runtime_error("NuiFusionCreateImageFrame failed (Float).");

 // PointCloud
 hr = NuiFusionCreateImageFrame(NUI_FUSION_IMAGE_TYPE_POINT_CLOUD, WIDTH, HEIGHT, nullptr, &m_pPointCloud);
 if (FAILED(hr))
  throw std::runtime_error("NuiFusionCreateImageFrame failed (PointCloud).");

 // ShadedSurface
 hr = NuiFusionCreateImageFrame(NUI_FUSION_IMAGE_TYPE_COLOR, WIDTH, HEIGHT, nullptr, &m_pShadedSurface);
 if (FAILED(hr))
  throw std::runtime_error("NuiFusionCreateImageFrame failed (Color).");

 m_pVolume->ResetReconstruction( &IdentityMatrix(), nullptr);
}
Within the function, we need to setup the volume's parameters, like the size and resolution. Depends on the memory available on GPU, size and number of voxel can be set for higher or lower spacial resolution. There are three kind of data will be produced through the Kinect Fusion process. The depth float image contains the depth data from the Kinect in float format. From the depth data, we can create point cloud data. The shaded surface image shows ray-casting shaded surface of the volume at current camera position. The following codes show the process from depth image to volume reconstruction and ray-casting image for visualisation.
void depthSensor::processKinectFusion( const NUI_DEPTH_IMAGE_PIXEL* depthPixel, int depthPixelSize, cv::Mat& mat ) 
{
 // Depth pixel (mm) to float depth (m)
 HRESULT hr = NuiFusionDepthToDepthFloatFrame( depthPixel, WIDTH, HEIGHT, m_pDepthFloatImage,
  NUI_FUSION_DEFAULT_MINIMUM_DEPTH, NUI_FUSION_DEFAULT_MAXIMUM_DEPTH, TRUE );
 if (FAILED(hr)) {
  throw std::runtime_error( "::NuiFusionDepthToDepthFloatFrame failed." );
 }

 // Update volume based on new depth frame
 Matrix4 worldToCameraTransform;
 m_pVolume->GetCurrentWorldToCameraTransform( &worldToCameraTransform );
 hr = m_pVolume->IntegrateFrame(m_pDepthFloatImage, UPDATE_WEIGHT, &worldToCameraTransform);
  hr = m_pVolume->ProcessFrame( m_pDepthFloatImage, NUI_FUSION_DEFAULT_ALIGN_ITERATION_COUNT,
   UPDATE_WEIGHT, &worldToCameraTransform );
  if (FAILED(hr)) {
   // If tracking is failed, count it
   // Reconstruction will be reset it enough number of failed has achieved
   ++trackingErrorCount;
   if ( trackingErrorCount >= 100 ) {
    trackingErrorCount = 0;
    m_pVolume->ResetReconstruction( &IdentityMatrix(), nullptr );
   }
 
   return;
  }

 // Get PointCloud from the volume
 hr = m_pVolume->CalculatePointCloud( m_pPointCloud, &worldToCameraTransform );
 if (FAILED(hr)) {
  throw std::runtime_error( "CalculatePointCloud failed." );
 }

 // Ray-casting shaded PointCloud based on current camera pose
 hr = ::NuiFusionShadePointCloud( m_pPointCloud, &worldToCameraTransform,
  nullptr, m_pShadedSurface, nullptr );
 if (FAILED(hr)) {
  throw std::runtime_error( "::NuiFusionShadePointCloud failed." );
 }

 // Get 2D frame to be shown
 INuiFrameTexture * pShadedImageTexture = m_pShadedSurface->pFrameTexture;
 NUI_LOCKED_RECT ShadedLockedRect;
 hr = pShadedImageTexture->LockRect(0, &ShadedLockedRect, nullptr, 0);
 if (FAILED(hr)) {
  throw std::runtime_error( "LockRect failed." );
 }

 mat = cv::Mat( HEIGHT, WIDTH, CV_8UC4, ShadedLockedRect.pBits );

 // We're done with the texture so unlock it
 pShadedImageTexture->UnlockRect(0);
}
At each frame, the camera pose is tracked so that new point cloud can be integrated into current volume correctly. If the camera is static, then we don't need to track camera position and the transformation matrix is set to identity.

Full code download: depthSensor.cpp; depthSensor.h

18 March 2013

Experiments on interference problem of two Kinects

In this post, I will show the results of two experiments that can mitigate the cross-talk between two Kinects. One is time multiplexing method and another is to polarize infrared light by polarizing film.

- Time multiplexing method
As mentioned in another post, MS Kinect SDK provides only function (NuiSetForceInfraredEmitterOff) to turn the IR emitter on/off without knowing the speed. Nevertheless, it is worth experimenting the performance of this function.
The idea is turn off another Kinect's IR emitter while current Kinect is capturing depth frame. The codes is like this:
 // -- process 1st kinect
 kinect2->turnOffIR(true);
 kinect1->processDepthFrame();
 kinect2->turnOffIR(false);
 kinect1->processColorFrame();
 kinect1->processPointCloud(); 
 // -- process 2nd kinect
 kinect1->turnOffIR(true);
 kinect2->processDepthFrame();
 kinect1->turnOffIR(false);
 kinect2->processColorFrame();
 kinect2->processPointCloud();
The turnOffIR is a member function of depthSensor class that calls the NuiSetForceInfraredEmitterOff.
The result from above code is unstable because insufficient time for SDK to turn off the IR emitter before the depth frame is retrieved. Therefore, I add different delays after calling the function to turnOffIR and the result in shown below.
Figure 1 No delay.
Figure 2 100 ms delay.
Figure 3 150 ms delay.
Figure 4 200 ms delay.
 As the duration of delay is increasing, less holes caused by the interference. The best and most stable result can be achieved with about 200 ms delay per Kinect. For smaller delay, such as 50 ms, the point cloud become unstable.
Figure 5 50 ms delay.
The Figure 5 shows the point cloud captured in different time with same delay. The quality of the point cloud varying a lots. I guess that is because the amount of time required for turning off the IR emitter is different, so sometimes the IR emitter is turned off but sometimes it doesn't.
For the 200 ms delay, the frame rate of point cloud is decreased to roughly 2.5 fps (400 ms each frame).

- Polarization method
The idea is to distinguish the IR lights from different emitters by polarizing them in different direction. The Figure 6 shows the principle of the polarization of light. From the upper right diagram we can see letter 'A' easily  as two polarizing films are placed in same direction. The lower left diagram shows that the light is blocked completely when one film is rotated 90 degrees from another.
Figure 6 Polarization using two polarizing films. 
Finally, each Kinect is being attached two films with same direction of polarization, as shown in Figure 7. The other Kinect has two films with 90 degrees rotation. Hence, the IR camera can only see the IR light from its own emitter.
Figure 7 Two films with same direction of polarization are attached to IR emitter and camera. The blue stripes are just for illustration.
Here are some results of point cloud with polarization. Compared to the result without polarization, it is shows that the polarization method indeed improves the holes at the interference area while still maintaining real time frame rate. However, the point cloud are unstable, as shown in Figure 8 and 9, the appearance of the holes changing all the time. The situation happens less within the shared field of view and it is more likely to occur at far away scene and steep surface. I guess the reason is that some IR lights received by the IR camera in such situation are too small and ignored.
Figure 8 Top view of point cloud with polarization at different time.
Figure 9 Side view of point cloud with polarization at different time.
To sum up, for near field of interest the polarization method actually works relatively well. It is a non-invasive and simple implementation for the interference problem.
UPDATE: More details experiment described in another experiment (link) shows that the polarization method hardly improves the overall interference from two Kinects.

16 March 2013

Point clouds generation and alignment for two Kinects

Finally, they are aligned.

In this post, I will introduce coloured point cloud generation based on depth (disparity) and RGB images captured by IR and RGB camera respectively. Afterwards, the point clouds generated from two Kinects are aligned together based on the stereo calibration results from previous post: link.

- Point cloud generation
As the colour camera and depth camera are not at the same location and the field of view are different, the depth frame and colour frame are not aligned together without transformation. It is easy to implement the alignment between the colour and depth frame by using Kinect SDK which provides methods to transform the frame from one to the other. Here is the procedures with the API I used:

i. Given pixel coordinate in depth image, use this function NuiImageGetColorPixelCoordinatesFromDepthPixelAtResolution to get its corresponding pixel coordinate (x, y) in colour image.
ii. If the pixel (x, y) exceeds the boundary of the colour image, then set this point to nothing (located at origin without colour). Otherwise, colour the depth pixel.
iii. Given pixel coordinate in depth image, use NuiTransformDepthImageToSkeleton to get its 3D coordinate in skeleton space. Based on SDK document, the skeleton system mirrors the scene so a person facing to Kinect is considered to be looking in the -z direction in skeleton space. Here we need to invert x value in order to get normal non-inverted image.
iv. The function NuiTransformDepthImageToSkeleton returns a 4 elements vector which is represented as [x, y, z, w]. In order to get the real coordinate of the pixel (point), we can divide the first three elements (x, y and z) by the fourth elements that is: X = x/w, Y = y/w and Z = z/w.

- Point cloud alignment
It is a straightforward problem if we can correctly calibrate the extrinsic parameters and clearly define the coordinate systems. One thing I got stuck for long time is the different coordinate definition for Kinect. As shown in Figure 1, the SDK defines right-handed coordinate system that places a Kinect at the origin with the positive z-axis extending in the direction in which the Kinect is pointed. The positive y-axis extends upward, and the positive x-axis extends to the left. However, the MATLAB calibration toolbox also uses right-handed system but rotated around Z axis for 180 degrees from skeleton coordinate.

Figure 1 Kinect skeleton coordinate (blue) and camera coordinate (red) in MATLAB calibration toolbox.
Therefore, if we use the calibration result from the MATLAB toolbox we need to transform the point from skeleton coordinate to MATLAB camera coordinate and transform it back after the transformation from one Kinect coordinate to another has finished.
Figure 2 An animation for the point cloud alignment result.
As shown in the Figure 2, the result of alignment is promising but the interference problem still existed.

12 March 2013

Calibration of Kinect IR camera


Since the sdk function NuiImageGetColorPixelCoordinatesFromDepthPixelAtResolution gives pixel coordinates in color space that correspond to the specified pixel coordinates in depth space, I use IR camera (depth space) as reference.
I used Bouguet's MATLAB calibration toolbox to calibrate the IR camera individually and then find out extrinsic parameters between them. In this post I will focus on result and discussion, a comprehensive and completed step-by-step tutorial of the toolbox can be found in the website.
Before  taking IR images for the calibration I turned off the IR emitter by SDK function (NuiSetForceInfraredEmitterOff(TRUE)) so that no IR pattern would affect quality of the IR images. The first and second cameras are called the first and second camera respectively. For both camera, I chose 12 different IR images which represent different pose of calibration pattern, as shown in Figure 1.
Figure 1 12 calibration images
Let's start with the left IR camera. After reading the images into MATLAB, we need to extract the corners of the pattern. It can be done by selecting four corners of the calibration pattern in each image and the toolbox can detect the corner within the selected area automatically, as shown in Figure 2. In terms of selecting four boundary points, I would suggest to select the inner points rather than four boundaries of the pattern. In this case, the toolbox can detect the corners better so makes our life easier for later re-computation stage.
Figure 2 Grid corners extraction
After all grid corners have been extracted, we can visualise the extrinsic results (Figure 3) and reprojection error.
Figure 3 Extrinsic parameters (camera-centred)
The  intrinsic parameters of the left camera can be computed with uncertainties as shown below. More uncertainties, more error existed. In order to minimise the error, we can recompute the parameters by choosing the images that occur most error in the error diagram and extract the corners in a smaller and less distorted area. As shown in Figure 4, the corners are extracted within a 5x5 area instead of the original 5x7 area in the selected images.

Calibration results after optimization (with uncertainties):
Focal Length:          fc = [ 569.92842   571.44089 ] ± [ 16.23688   14.06564 ]
Principal point:       cc = [ 278.08552   258.82575 ] ± [ 25.86257   24.07492 ]
Skew:             alpha_c = [ 0.00000 ] ± [ 0.00000  ]   => angle of pixel axes = 90.00000 ± 0.00000 degrees
Distortion:            kc = [ -0.28627   0.90685   0.00121   -0.01097  0.00000 ] ± [ 0.18908   1.79441   0.01032   0.00938  0.00000 ]
Pixel error:          err = [ 0.28284   0.62366 ]
Figure 4 Grid corner extraction for re-calibration
Next, we can see the error became smaller than the first calibration results:

Calibration results after optimization (with uncertainties):
Focal Length:          fc = [ 581.12084   581.34563 ] ± [ 10.53285   9.09579 ]
Principal point:       cc = [ 281.39427   253.27383 ] ± [ 16.64980   15.86515 ]
Skew:             alpha_c = [ 0.00000 ] ± [ 0.00000  ]   => angle of pixel axes = 90.00000 ± 0.00000 degrees
Distortion:            kc = [ -0.26651   0.92108   0.00301   -0.01151  0.00000 ] ± [ 0.13097   1.44999   0.00626   0.00590  0.00000 ]
Pixel error:          err = [ 0.22538   0.35812 ]

By doing the procedures over and over again until an acceptable result has been achieved:

Calibration results after optimization (with uncertainties):
Focal Length:          fc = [ 579.39816   580.78490 ] ± [ 9.58715   8.48105 ]
Principal point:       cc = [ 279.82800   257.11149 ] ± [ 14.13074   12.45446 ]
Skew:             alpha_c = [ 0.00000 ] ± [ 0.00000  ]   => angle of pixel axes = 90.00000 ± 0.00000 degrees
Distortion:            kc = [ -0.32611   1.61254   0.00131   -0.01085  0.00000 ] ± [ 0.10937   1.22361   0.00600   0.00512  0.00000 ]
Pixel error:          err = [ 0.21153   0.29514 ]

For the right camera, the procedures are exactly same but one thing need to be noted is that the extracted corners in the left camera must correspond to the extracted corners in the right camera. Therefore, we need to be careful to the sequence when choosing the boundary of the calibration pattern in the images captured by the right camera.

Finally, the extrinsic parameters or in another word the relative position and orientation between two IR camera can be computed based on the individual calibration results. The extrinsic results is:


Extrinsic parameters (position of right camera wrt left camera):
Rotation vector:             om = [ 0.12700   1.83078  2.45355 ]
Translation vector:           T = [ 37.16528   -993.13499  676.55744 ]


The om vector is rotation vector which can be related to rotation matrix by Rodrigues' rotation formula. The three elements in the translation vector should be the rigid translation in x, y and z direction with unit of millimetre (mm). From Figure 5 we can visualise the relation between the left, right camera and the calibration pattern.
Say we have a point in the world and known its coordinate (XL) in left camera reference frame, its coordinate (XR) in the right camera reference frame can be calculated by:
XR = R * XL + T
Figure 5 Extrinsic parameters of stereo calibration

18 February 2013

Kinect Coordinate Space

A Kinect is able to provide colour,  depth and skeleton  frame at a time. This post will introduce the coordinate space of each frame and at later post, I will use SDK to transform one frame to another, for example, aligning depth frame with colour frame and creating coloured point clouds. Because the Kinect SDK provides more accurate mapping of depth data into 3D space (or “skeleton" space”, in the parlance of this particular SDK).

Colour space (2D)
A colour frame is composed of pixels. Each pixel is consist of 4 channels - red, green, blue and alpha (transparency). Every pixel has an particular location (x, y) in the colour frame coordinate.

Depth space (2D)
Each pixel in the depth frame contains the Cartesian distance, in millimeters, from the depth camera plane to the nearest object at that particular (x, y) coordinate, as shown in the figure. The (x, y) coordinate of a depth frame do not represent physical unit in the room but the location of a pixel in the depth frame.
Depth stream values
Skeleton Space (3D)
For each frame, the depth image captured by the Kinect is converted into skeleton data which contains 3D position data for human skeletons. A 3D position in the skeleton space is represented as (x, y, z), as shown in the figure. The origin of the coordinate is placed at the Kinect depth camera, and it is a right-handed coordinate system. More specifically, positive y-axis extends upward, and the positive x-axis extends to the left of the Kinect. The positive z-axis is pointing to direction that Kinect is looking at.
Skeleton  space

30 January 2013

Iterative Closest Points (ICP)

From the depth images captured by IR camera, we can derive point clouds which show the 3D shape of the target. An example of point clouds is shown in figure 1. An registration issue arises when we have 2 point clouds of same scene captured by different camera from different point of view. The overall goal is to find a transformation which can bring one point clouds as close to another clouds as possible. The most common method is called Iterative Closest Points (ICP) and various variants of it have been well developed.

Figure 1 Point clouds produced from Kinect depth data
We will call the first set of points model points, and the other set data points. The model point set are denoted Q = {q1, q2, q3,...,qn} and data point set P = {p1, p2, p3,...,pn}. We can set an objective function (error metric) E to evaluate the distance between two set of points. A popular metric is sum of squared errors:
which is squared distance from points in one cloud to their nearest neighbours in the other after the transformation τ.
There are various kind of transformation we can apply including rigid or non-rigid depends on the task. Here we use only rotation (R) and translation (T) only since the scene (patient) doesn't deform much and it keeps the calculation realistic and simple. Therefore, the objective function can be rewritten as:
Please keep in mind that the correspondence of model and data points are unknown. To this end, the ICP try to approximate the correspondence based on the nearest neighbour before transformation τ. The ICP performs following steps iteratively:
  1. Matching: for every data point the nearest neighbour in the model point set is found.
  2. Minimization: the objective function is minimised and the transformation parameters are estimated.
  3. Transform the data points using the estimated transformation.
  4. Iterate (re-matching the new data point with model point).
The algorithm is terminated based on the predefined number of iterations or relative change of objective function.

Reference:
Hans, Kjer and Jakob Wilm, 2010. Evaluation of surface registration algorithms for PET motion correction. Bachelor theis, Technical University of Denmark.