Research Information - Fall 2005

Large-scale Content Distribution Systems

By Terence Chan

The rapid growth and development of the Internet is probably one of the most crucial technological advances in the decades. There are countless Internet applications in all areas of commerce, education, entertainment, governance, etc. For example, it is used for rapid distribution of software packages or security updates, for webcasting popular multimedia clips to a flash crowd of subscribers, and for replication of file storages or databases of a multinational corporation at geographically separated places. In most cases, the key use of the Internet involves the dissemination of large amount of data to huge groups of users over a wide area possibly spanning across the globe. A well-known real world example is the Bit-Torrent (BT) system which has been used for distributing the operating system package Linux and for sharing audio and video files.

Roughly speaking, the way a piece of data transmitted from one point to another (e.g., the transmission of security updates to a user) is very similar to trucks traveling across a transportation network formed by superhighways that are connecting cities to cities. Data will travel across a computer network composed of a huge amount of intermediate machines (routers/switches) interconnected by communications channels/links such as optical fibres. Like highways, each communication link has a capacity limiting the maximal amount of data that can be sent along it in every second. Hence, the main challenge is how to design a reliable and efficient content distribution system which can efficiently use these communication links.

Basically, there are three approaches to transmit a piece of data to a group of users. 1. Unicast - This is the most simplest approach, in which each user will independently set up a "path/connection" to the content distribution server so that the requested piece of data will travel along it in order to reach the user. Therefore, if the single piece of data (e.g., a software package) is requested by one hundred users, the distribution server will then send the same piece of data one hundred times following the paths independently set up by the users. As a result, if two paths share the same communication link, the piece of data will be needed to sent along the link twice. Clearly, this is a waste of transmission data rate by sending the same piece of information twice along the link. Consequently, this approach is not scalable, i.e., it is not suitable for use if the number of users is huge.

2. Tree-based routing - In the second approach, the efficiency of utilizing communication links is improved by minimizing the transmission of repeated information. For example, instead of sending multiple copies of a piece of data along a link, one can merely send a copy of it and then make replicates at the end of the link. Essentially, in this approach, the content distribution server will create a distribution/multicast tree connecting the server to all the users. Given such a tree, one can always find a unique path connecting the content distribution server to any given user. Using the path, the data can then be transmitted to the users as requested.

3. Network coding - Using the tree-based routing approach, the piece of information routed along the distribution tree is unchanged during the whole transmission process. In this case, the intermediate machines will only replicate the received data and forward it to subsequent links. As data is only a sequence of bits, it can be modified/processed arbitrarily. In the network coding approach, the transmitted data can be modified by intermediate machines (according to a network code) during the transmission process. By allowing so, one can further improve the utilization efficiency of the network. In certain scenario, the efficiency gain can be tremendous, compared to the tree-based routing approach. There are tremendous amount of interesting and challenging research problems in the area of designing large-scale content distribution systems. If you are interested in knowing more details, please feel free to contact Terence Chan at

Bioinformatics and Medical Data Analysis - Fall 2005

By Dominic Slezak

Enormous growth of data sets is significantly slowing down the discovery process in the areas related to biology and biomedicine. Researchers are confronted with an increasing need to develop tools to process and support vast amounts of biological data. In medicine, manual methods are no longer efficient; they take a long time, are usually very expensive, and often consider only selected, specific cases. Currently, the average drug takes approximately ten years to go from the discovery phase to the clinic and costs the company about $400 million to more than $1 billion to develop. Moreover, the data sets to be analyzed in the case of the most common diseases like e.g. Alzheimer¡¯s, >ancer or obesity are so complex that they are even impossible to be processed in "classical" way.

The discipline concerned with the above challenges is bioinformatics. Its main goal is to develop solutions for gathering, storing, analyzing and integrating biological and genetic data and represent all this information efficiently. Bioinformatics is considered to be the infrastructure of molecular biology, which is known to be crucial to the whole biology field. Nowadays, bioinformatics is used in pharmaceutical companies at every stage of the drug discovery and development process.

s an example of development in bioinformatics, we consider uncovering the disease-specific patterns in brain structures. Currently available tools are becoming to be able to model how brain grows, detect abnormalities, and visualize how genes, medication, and demographic factors affect them. Image analysis methods can also identify and monitor systematic patterns of altered anatomy in Alzheimer's, Parkinson's, and Lyme diseases, as well as tumour growth, epilepsy, and multiple sclerosis, and even some psychiatric disorders like schizophrenia, depression, autism, and dyslexia. One of the most popular and least harmful techniques for getting brain images is magnetic resonance imaging (MRI). The MRI scans create more than just static images of the brain's structure. They enable to generate functional MRIs (fMRI) and observe the brain's cognitive activities. The image-based brain modeling requires computerized image processing methods, to help in analyzing the gathered data. We are particularly interested in the brain tissue type identification, as the current acquisition processes are insufficient for the successful differentiation between normal and pathologic cases. Precisely, we consider the data segmentation task - assigning the right labels to pixels (called voxels in 3D data sets), where labels for the brain images correspond to tissue types. Relative distributions and changes at the level of brain tissues allow us for diagnosing many specific diseases. Automated segmentation methods provide also a useful adjunct to clinical radiology.

Among many popular MRI segmentation methods - based on cluster analysis, neural networks, etc.- we focus especially on the approach based on the theory of rough sets, providing powerful methodology for constructing classification and prediction models. A basic idea is to build approximations of the data types (brain tissues, in this particular case) by means of available attributes (features, dimensions). Extracting potentially relevant attributes from images relies, e.g., on the edge detection and the magnitude histogram clustering. The process of selection and reduction of attributes, that provide most clear and useful segmentation procedures, is supported by the AI-based methods, like e.g. hybrid genetic algorithms. Using the best-found sets of attributes - called rough set reducts - we generate "if...then..." rules classifying voxels in the new brain images.

The above-described approach states an example how the data analysis techniques can contribute to bioinformatics and biomedicine. Good results of experiments performed not only for MRI segmentation, but also for other tasks, like e.g. mining the gene expression data, motivate for further research in those important areas. If you are interested, please send an e-mail to:

To Top of Page