Tuesday, November 23, 2010

Optimization -05

Tips for Optimizing C++ Code

12. Careful when declaring C++ object variables.
  • Use initialization instead of assignment (Color c(black); is faster than Color c; c = black;).

13. Make default class constructors as lightweight as possible.
  • Particularly for simple, frequently used classes (e.g., color, vector, point, etc.) that are manipulated frequently.
  • These default constructors are often called behind your back, where you are not expecting it.
  • Use constructor initializer lists. (Use Color::Color() : r(0), g(0), b(0) {} rather than Color::Color() { r = g = b = 0; } .)

16. For most classes, use the operators +=, -=, *=, and /=, instead of the operators +, -, *, and /.
  • The simple operations need to create an unnamed, temporary intermediate object.
  • For instance: Vector v = Vector(1,0,0) + Vector(0,1,0) + Vector(0,0,1); creates five unnamed, temporary Vectors: Vector(1,0,0), Vector(0,1,0), Vector(0,0,1), Vector(1,0,0) + Vector(0,1,0), and Vector(1,0,0) + Vector(0,1,0) + Vector(0,0,1).
  • The slightly more verbose code: Vector v(1,0,0); v+= Vector(0,1,0); v+= Vector(0,0,1); only creates two temporary Vectors: Vector(0,1,0) and Vector(0,0,1). This saves 6 functions calls (3 constructors and 3 destructors).

18. Delay declaring local variables.
  • Declaring object variable always involves a function call (to the constructor).
  • If a variable is only needed sometimes (e.g., inside an if statement) only declare when necessary, so the constructor is only called if the variable will be used.

19. For objects, use the prefix operator (++obj) instead of the postfix operator (obj++).
  • This probably will not be an issue in your ray tracer.
  • A copy of the object must be made with the postfix operator (which thus involves an extra call the the constructor and destructor), whereas the prefix operator does not need a temporary copy.

20. Careful using templates.
  • Optimizations for various instantiations may need to be different!
  • The standard template library is reasonably well optimized, but I would avoid using it if you plan to implement an interactive ray tracer.
  • Why? By implementing it yourself, you'll know the algorithms it uses, so you will know the most efficient way to use the code.
  • More importantly, my experience is that debug compiles of STL libraries are slow. Normally this isn't a problem, except you will be using debug versions for profiling. You'll find STL constructors, iterators, etc. use 15+% of your run time, which can make reading the profile output more confusing.


Optimization -04

Tips for Optimizing C Code

1. Remember Ahmdal's Law:
  • Where funccost is the percentage of the program runtime used by the function func, and funcspeedup is the factor by which you speedup the function.
  • Thus, if you optimize the function func(), which is 40% of the runtime, so that it runs twice as fast, your program will run 25% faster .
  • This means infrequently used code (e.g., the scene loader) probably should be optimized little (if at all).

2. First Code for correctness, then optimize!
  • This does not mean write a fully functional ray tracer for 8 weeks, then optimize for 8 weeks!
  • Perform optimizations on your ray tracer in multiple steps.
  • Write for correctness, and then if you know the function will be called frequently, perform obvious optimizations.
  • Then profile to find bottlenecks, and remove the bottlenecks (by optimization or by improving the algorithm). Often improving the algorithm drastically changes the bottleneck – perhaps to a function you might not expect. This is a good reason to perform obvious optimizations on all functions you know will be frequently used.

3. People I know who write very efficient code say- they spend at least twice as long optimizing code as they spend writing code.

4. Jumps/branches are expensive. Minimize their use whenever possible.
  • Function calls require two jumps, in addition to stack memory manipulation.
  • Prefer iteration over recursion.
  • Use inline functions for short functions to eliminate function overhead.
  • Move loops inside function calls
    e.g., change for(i=0;i<100;i++){ ... } DoSomething();
    into DoSomething() { for(i=0;i<100;i++) { ... } } .
  • Long if...else if...else if...else if... chains require lots of jumps for cases near the end of the chain (in addition to testing each condition). If possible, convert to a switch statement, which the compiler sometimes optimizes into a table lookup with a single jump. If a switch statement is not possible, put the most common clauses at the beginning of the if chain.

5. Think about the order of array indices.
  • Two and higher dimensional arrays are still stored in one dimensional memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1] are adjacent to each other, whereas array[i][j] and array[i+1][j] may be arbitrarily far apart.
  • Accessing data in a more-or-less sequential fashion, as stored in physical memory, can dramatically speed up your code (sometimes by an order of magnitude, or more)!
  • When modern CPUs load data from main memory into processor cache, they fetch more than a single value. Instead they fetch a block of memory containing the requested data and adjacent data (a cache line). This means after array[i][j] is in the CPU cache, array[i][j+1] has a good chance of already being in cache, whereas array[i+1][j] is likely to still be in main memory.

6. Think about instruction-level-parallelism.
  • Even though many applications still rely on single threaded execution, modern CPUs already have a significant amount of parallelism inside a single core. This means a single CPU might be simultaneously executing 4 floating point multiplies, waiting for 4 memory requests, and performing a comparison for an upcoming branch.
  • To make the most of this parallelism, blocks of code (i.e., between jumps) need to have enough independent instructions to allow the CPU to be fully utilized.
  • Think about unrolling loops to improve this.
  • This is also a good reason to use inline functions.

7. Avoid/reduce the number of local variables.
  • Local variables are normally stored on the stack. However if there are few enough, they can instead be stored in registers. In this case, the function not only gets the benefit of the faster memory access of data stored in registers, but the function avoids the overhead of setting up a stack frame.
  • (Do not, however, switch wholesale to global variables!)
8. Reduce the number of function parameters.
  • For the same reason as reducing local variables – they are also stored on the stack.

9. Pass structures by reference, not by value.
  • I know of no case in a ray tracer where structures should be passed by value (even simple ones like Vectors, Points, and Colors).

10. If you do not need a return value from a function, do not define one.

11. Try to avoid casting where possible.
  • Integer and floating point instructions often operate on different registers, so a cast requires a copy.
  • Shorter integer types (char and short) still require the use of a full-sized register, and they need to be padded to 32/64-bits and then converted back to the smaller size before storing back in memory. (However, this cost must be weighed against the additional memory cost of a larger data type.)

12. Careful when declaring C variables.
  • Use initialization instead of assignment (Color c = black); is faster than Color c; c = black ;).

14. Use shift operations >> and << instead of integer multiplication and division, where possible.

15. Careful using table-lookup functions.
  • Many people encourage using tables of precomputed values for complex functions (e.g., trigonometric functions). For ray tracing, this is often unnecessary. Memory lookups are exceedingly (and increasingly) expensive, and it is often as fast to recompute a trigonometric function as it is to retrieve the value from memory (especially when you consider the trig lookup pollutes the CPU cache).
  • In other instances, lookup tables may be quite useful. For GPU programming, table lookups are often preferred for complex functions.

17. For basic data types, use the operators +, -, *, and / instead of the operators +=, -=, *=, and /=.

21. Avoid dynamic memory allocation during computation.
  • Dynamic memory is great for storing the scene and other data that does not change during computation.
  • However, on many (most) systems dynamic memory allocation requires the use of locks to control an access to the allocator. For multi-threaded applications that use dynamic memory, you may actually get a slowdown by adding additional processors, due to the wait to allocate and free memory!
  • Even for single threaded applications, allocating memory on the heap is more expensive than adding it on the stack. The operating system needs to perform some computation to find a memory block of the requisite size.

22. Find and utilize information about your system's memory cache.
  • If a data structure fits in a single cache line, only a single fetch from main memory is required to process the entire class.
  • Make sure all data structures are aligned to cache line boundaries. (If both your data structure and a cache line are 128 bytes, you will still have poor performance if 1 byte of your structure is in once cache line and the other 127 bytes are in a second cache line).

23. Avoid unnecessary data initialization.
  • If you must initialize a large chunk of memory, consider using memset().

24. Try to early loop termination and early function returns.
  • Consider intersecting a ray and a triangle. The "common case" is that the ray will miss the triangle. Thus, this should be optimized for.
  • If you decide to intersect the ray with the triangle plane, you can immediately return if the t value ray-plane intersection is negative. This allows you to skip the barycentric coordinate computation in roughly half of the ray-triangle intersections. A big win! As soon as you know no intersection occurs, the intersection function should quit.
  • Similarly, some loops can be terminated early. For instance, when shooting shadow rays, the location of the nearest intersection is unnecessary. As soon as any occluding intersection is found, the intersection routine can return.

25. Simplify your equations on paper!
  • In many equations, terms cancel out... either always or in some special cases.
  • The compiler cannot find these simplifications, but you can. Eliminating a few expensive operations inside an inner loop can speed your program more than days working on other parts.

26. The difference between math on integers, fixed points, 32-bit floats, and 64-bit doubles is not as big as you might think.
  • On modern CPUs, floating-point operations have essentially the same throughput as integer operations. In compute-intensive programs like ray tracing, this leads to a negligible difference between integer and floating-point costs. This means, you should not go out of your way to use integer operations.
  • Double precision floating-point operations may not be slower than single precision floats, particularly on 64-bit machines. I have seen ray tracers run faster using all doubles than all floats on the same machine. I have also seen the reverse.

27. Consider ways of rephrasing your math to eliminate expensive operations.
  • sqrt() can often be avoided, especially in comparisons where comparing the value squared gives the same result.
  • If you repeatedly divide by x, consider computing 1/x and multiplying by the result. This used to be a big win for vector normalizations (3 divides), but I've recently found it's now a toss-up. However, it should still be beneficial if you do more than 3 divides.
  • If you perform a loop, make sure computations that do not change between iterations are pulled out of the loop.
  • Consider if you can compute values in a loop incrementally (instead of computing from scratch each iteration).


Monday, November 15, 2010

Socket Programming -01

Socket Programming - Server (Rx)

#include <windows.h>
#include <winsock2.h>
#include <ws2tcpip.h>

#include <stdlib.h>
#include <stdio.h>
#include <conio.h>

// Need to link with Ws2_32.lib, Mswsock.lib, and Advapi32.lib
#pragma comment (lib, "Ws2_32.lib")
#pragma comment (lib, "Mswsock.lib")
#pragma comment (lib, "AdvApi32.lib")

#define DEFAULT_BUFLEN 1452
#define DEFAULT_PORT "27015"

int main (void)
WSADATA wsaData;
    FILE *fp;
    SOCKET ListenSocket = INVALID_SOCKET, ClientSocket = INVALID_SOCKET;
    struct addrinfo *result = NULL, hints;
    char recvbuf[DEFAULT_BUFLEN];

    int iResult, iSendResult;
    int recvbuflen = DEFAULT_BUFLEN;

fp = fopen("receive.dat","wb");

// Initialize Winsock
iResult = WSAStartup(MAKEWORD(2,2), &wsaData);
if (iResult != 0) {
printf("WSAStartup failed with error: %d\n", iResult);

return 1;

ZeroMemory(&hints, sizeof(hints));
hints.ai_family = AF_INET;
hints.ai_socktype = SOCK_STREAM;
hints.ai_protocol = IPPROTO_TCP;
hints.ai_flags = AI_PASSIVE;

// Resolve the server address and port
iResult = getaddrinfo(NULL, DEFAULT_PORT, &hints, &result);
if ( iResult != 0 ) {
printf("getaddrinfo failed with error: %d\n", iResult);

return 1;

// Create a SOCKET for connecting to server
ListenSocket = socket(result->ai_family, result->ai_socktype, result->ai_protocol);
if (ListenSocket == INVALID_SOCKET) {
printf("socket failed with error: %ld\n", WSAGetLastError());

return 1;

// Setup the TCP listening socket
iResult = bind( ListenSocket, result->ai_addr, (int)result->ai_addrlen);
if (iResult == SOCKET_ERROR) {
printf("bind failed with error: %d\n", WSAGetLastError());

return 1;

iResult = listen(ListenSocket, SOMAXCONN);
if (iResult == SOCKET_ERROR) {
printf("listen failed with error: %d\n", WSAGetLastError());

return 1;

// Accept a client socket
ClientSocket = accept(ListenSocket, NULL, NULL);
if (ClientSocket == INVALID_SOCKET) {
printf("accept failed with error: %d\n", WSAGetLastError());

return 1;

// No longer need server socket

// Receive until the peer shuts down the connection
do {
iResult = recv(ClientSocket, recvbuf, recvbuflen, 0);

if (iResult > 0) {
printf("Bytes received: %d\n", iResult);

// Echo the buffer back to the sender
iSendResult = send( ClientSocket, recvbuf, iResult, 0 );

if (iSendResult == SOCKET_ERROR) {
printf("send failed with error: %d\n", WSAGetLastError());

return 1;
printf("Bytes sent: %d\n", iSendResult);
if (iResult == 0)
printf("Connection closing...\n");
else {
printf("recv failed with error: %d\n", WSAGetLastError());

return 1;
} while (iResult > 0);

// shutdown the connection since we're done
iResult = shutdown(ClientSocket, SD_SEND);

if (iResult == SOCKET_ERROR) {
printf("shutdown failed with error: %d\n", WSAGetLastError());

return 1;

// cleanup
printf("Sucessfully Received Data...\n");
return 0;


Tuesday, November 9, 2010

Matlab - 01

Working with Videos in MATLAB

MATLAB supports only "raw" (uncompressed) avi files on Linux and only some Indeo and Cinepack compressed versions on Windows. But mostly we get videos in MPEG. So we need to convert to raw or any supported format. For this, FFmpeg can be used.
AVI means Audio Video Interleave and is only a container format. AVI does not mean raw video. It is possible to have mpeg or many other compressed avi files. MATLAB cannot read such a compressed avi files. (See http://en.wikipedia.org/wiki/Audio_Video_Interleave).

1. Section: If we have raw video (or supported format)
If we have raw video, then handling in MATLAB is quite easy. We need to use functions mmreader, read, movie, mmfileinfo, frame2im, im2frame, aviread, avifile, aviinfo, addframe (avifile), close (avifile), movie2avi functions. Following code reads and plays back a given video. For further details and other functions see MATLAB help.

%Reads and plays back the movie file xylophone.mpg.
xyloObj = mmreader('xylophone.mpg');
nFrames = xyloObj.NumberOfFrames;
vidHeight = xyloObj.Height;
vidWidth = xyloObj.Width;

% Preallocate movie structure.
mov(1:nFrames) = struct('cdata', zeros(vidHeight, vidWidth, 3, 'uint8'),
'colormap', []);

% Read one frame at a time.
for k = 1 : nFrames
mov(k).cdata = read(xyloObj, k);

% Size a figure based on the video's width and height.
hf = figure;
set(hf, 'position', [150 150 vidWidth vidHeight])

% Play back the movie once at the video's frame rate.
movie(hf, mov, 1, xyloObj.FrameRate);

2. Section: If we have not supported video format
We can use any converter tool to convert compressed video to uncompressed avi format. One such tool is FFmpeg (http://ffmpeg.mplayerhq.hu/index.html). Mplayer, VLC and many other players are based on this codec.
If you are a Windows user, just download latest binary from http://ffdshow.faireal.net/mirror/ffmpeg/ (Any virus? I don't know. Download at your risk!). (You can use WinRAR to uncompress .7z - http://www.freedownloadscenter.com/Utilities/Compression_and_Zip_File_Utilities/WinRAR_Download.html )
Use this command to convert any compressed file (compressed.any) to uncompressed avi file (uncompressed.avi) (Note: pthreadGC2.dll file should be in the same directory as ffmpeg.exe).
ffmpeg.exe -i compressed.any -vcodec rawvideo uncompressed.avi
This should work most of the cases. However, sometimes it may not work (because, avi is a container format by Microsoft! :)). The converted video may have only blank frames or MATLAB may not be able to read it properly. In such a case, you can try steps explained in Section 3.

3. Section: If Section 2 does not work
Create a folder by name "images". Use this command to convert each frame of the compressed video to bmp (or ppm, jpg) images using ffmpeg or virtualDub and put it into the folder "images".
ffmpeg.exe -i compressed.any images/image_%d.bmp
Now you can use these images for processing. If you want uncompressed avi file, then use following MATLAB code to combine all raw images into a single uncompressed avi file.

%Script file to combine images to an uncompressed avi file
%Directory that contains images
in_dir = 'D:\temp\ffmpeg.rev11870\images\';
fout = 'D:\out.avi'; %Output file name
num_images = 341; %Number of images

%Set a suitable frame rate fps
aviobj = avifile(fout, 'compression', 'none', 'fps', 25);
for i = 1:num_images;
temp = sprintf('%d', i);
name = [in_dir, 'image_', temp, '.bmp']; %For ppm, change
img = imread(name);
frm = im2frame(img);
aviobj = addframe(aviobj,frm);
aviobj = close(aviobj);

4. Useful Functions
mmreader, read, movie, mmfileinfo, frame2im, im2frame, aviread, avifile, aviinfo, addframe (avifile), close (avifile), movie2avi

5. Supported File Formats
Supported File Formats
WindowsAVI (.avi),
MPEG-1 (
Motion JPEG 2000 (

Windows Media Video (.wmv, .asf, .asx),
and any format supported by Microsoft DirectShow.
MacintoshAVI (.avi),
MPEG-1 (
MPEG-4 (
.mp4, .m4v),
Motion JPEG 2000 (

Apple QuickTime Movie (.mov),
and any format supported by QuickTime as listed on http://www.apple.com/quicktime/player/specs.html.
LinuxMotion JPEG 2000 (.mj2),

Any format supported by your installed plug-ins for GStreamer 0.10 or above, as listed on http://gstreamer.freedesktop.org/documentation/plugins.html, including AVI (.avi) and Ogg Theora (.ogg).

Monday, November 8, 2010

OpenCV – 02

Working with Videos in OpenCV

#include <stdio.h>
#include <cv.h>
#include <highgui.h>

int main(void)

    /* Create an object that decodes the input video stream. */
    CvCapture *input_video = cvCaptureFromFile("VideoPlay_Demo.avi");
    if (input_video == NULL)
        /* Either the video didn't exist OR codec doesn't support */
        fprintf(stderr, "Error: Can't open video.\n");
        return -1;

    /* Read the video's frame size out of the AVI. */
    CvSize frame_size;
    frame_size.height = (int) cvGetCaptureProperty(input_video, CV_CAP_PROP_FRAME_HEIGHT);
    frame_size.width = (int) cvGetCaptureProperty(input_video, CV_CAP_PROP_FRAME_WIDTH);

    /* Determine the number of frames in the AVI. */
    long number_of_frames;
    number_of_frames = (int) cvGetCaptureProperty(input_video, CV_CAP_PROP_FRAME_COUNT);

    /* Create a windows called "VideoPlay_Demo" for output.
     * window automatically change its size to match the output. */
    cvNamedWindow("VideoPlay_Demo", CV_WINDOW_AUTOSIZE);

    long current_frame = 0;
        static IplImage *frame = NULL;
        /* Go to the frame we want */
        cvSetCaptureProperty( input_video, CV_CAP_PROP_POS_FRAMES, current_frame );

        /* Get the next frame of the video */
        frame = cvQueryFrame( input_video );
        if (frame == NULL)
            fprintf(stderr, "Error: Hmm. The end came sooner than we thought.\n");
            return -1;
        /* Now display the image */
        cvShowImage("VideoPlay_Demo", frame);
        /* And wait for. If the argument is 0 then it waits forever otherwise it waits that number of milliseconds */
        if (current_frame < 0)
            current_frame = 0;
        if (current_frame >= number_of_frames - 1)
            current_frame = number_of_frames - 2;