Site Navigation
Vector Space Search Engine Building a Spider Indexing the Internet Vector Space Postgres Programming What is a robots.txt file Stop List
Some Other Sites
Harry Jackson Job Site The Banana Tree HR-XML Builder -->

Please be patient this was written as I carried it out.

I apologise now this has not really been written for the novice to Postgres or Linux and I have taken extreme liberties in my wrtitings and explanations. For instance I hardly ever mention "VACUUM". I know what it is and I know how to use it and I am making the assumption you do to so I dont say when I vacuumed the database just assume it always is. OK ;-).

Postgres must store its data somehwere and luckily for us these are normal files (unlucky in some cases but that is a more advacned topic). This means that we can move these files around our system and them provide soft links to the file from the originating directory. Postgres continues to think that the file is held where it was originally. Enough chit chat lets see an example.

I have a database that is now several Gigabytes in size this meant that I had to start moving files around becasue the majority of my disks are only 20Gb in size and I have other things stored on there as well. I have just bought a 160Gb SATA drive for this purpose.

First thing I did was create a filesystem on the disk. I am going to be using this disk for very large files so the normal inode setting would be overkill for me so I can use the "largefile4" option of mke2fs. Use the man page to see what this means. I also created two partitions using fdisk which is a great little tool.

mke2fs -b 4096 -j -T largefile4 /dev/hde1
mke2fs -b 4096 -j -T largefile4 /dev/hde2

We now have a disk ready to have some of our postgres files copied to and then we can use soft links to it to free up some space on the SCSI disk. Before we go any further I want to stress that every system is different and in mine I have other limiting factors other than my disk speeds and the performance on my system. I am limited to 1Mb of bandwidth and I have a very limited budget which means I am going to make the database as easy to maintain as possible not necessarily as fast as possible. Please remember this when reading other material on the site. In nine out of ten cases performance is wrongly prioritised above other more inmportant factors, see my comments on bzip2 and gzip.

Here is the output of "df -m"

	Filesystem           1M-blocks      Used Available Use% Mounted on
        /dev/hdc2                 3938       961      2777  26% /
        /dev/hdc1                   30         9        20  30% /boot
        /dev/sda1                 3937      3006       731  81% /home
        none                       505         0       504   0% /dev/shm
        /dev/hdc7                 1969        33      1836   2% /tmp
        /dev/hdc3                 3938      2177      1561  59% /usr
        /dev/hdc5                 3938      3108       630  84% /opt/links/hdc5
        /dev/hdb1                 9628      6955      2184  77% /opt/oracle
        /dev/hdb2                 9605      4884      4233  54% /opt/oracle/oradata/ide12
        /dev/sda2                 3937      1019      2719  28% /opt/oracle/oradata/scsi02
        /dev/sda5                 9365      8625       272  97% /opt/oracle/oradata/scsi03
        /dev/hde1               156319        33    148470   1% /opt/oracle/oradata/hde1
	

You can see straight away that I have one very large disk. This is not ideal for various reasons but for reasons that I have already explained and some I will explain later I want an easy to administer database, I do not require blistering speed at this time.

My original database was held in the "/usr/local/pgsql/" directory and as far as postgres is aware it still is. However there are a few minor changes that I have made to make things a little more space efficient. For those with a keen eye this snapshot was taken on the 5 Oct 03 with over 35 Million links, 5 million of which are unique which if we where to do the maths would leave "/dev/hdc3" completely full so what have I done.

For a long time performance on my system was absolutely no concern to me at all and to a certain extent it still isn't, my box is a development and teaching platform and will remain as such. I am not a speed freak nor do I ever desire to be one, I understand "the need for speed" but only when it is required. However, I do like to see things done in a timely and efficient manner which led to the first change in the database.

A common thing I found myself doing when I first started using the database was to count how many links that where being found. I had set my "shmmax" to about 64Mb, which is fine for most systems and small website's so dont go changing it now if you have a smaller setting. I noticed this task starting to take longer and longer and decided to do something about it. At this point I really had two choices that would increase my perfomance with minimal outlay. I could either change to a faster disk or increase my shmmax setting. I decided that increasing my shmmax setting at this point would not be the most costly because once the shmmax setting has been allocated most of your my other programs could no longer access this memory. I also noticed that I had som space on my SCSI drive that was not being utilised so I decided to move my "/usr/local/pgsql/data/" directory as follows. Before doing this I made a backup and turened the database off.

        $ ls -la /usr/local/pgsql/
	total 32
        drwxr-xr-x    8 postgres postgres     4096 Sep 25 01:18 .
        drwxr-xr-x   17 root     root         4096 Sep 20 15:30 ..
        drwxr-xr-x    2 root     root         4096 May  1 12:30 bin
        lrwxrwxrwx    1 root     root           31 Sep 25 01:18 data -> /opt/oracle/oradata/scsi03/data
        drwxr-xr-x    3 root     root         4096 May  1 12:30 doc
        drwxr-xr-x    4 root     root         4096 May  1 12:30 include
        drwxr-xr-x    2 root     root         4096 May  1 12:30 lib
        drwxr-xr-x    4 root     root         4096 May  1 12:30 man
        drwxr-xr-x    2 root     root         4096 Aug 13 12:39 share
        

This freed up a lot of space on the "/dev/hdc3" disk. I knew at some point that I would need to carry out this operation again only a little bit more fine grained so I kept an eye on the disk usage.

To see what files are using the most disk space in the database we can do the following.

select relname, relfilenode, relpages from pg_class order by relpages DESC linmit 10;

Remember that relpages is not always necessarily accurate but it will tel you which files are taking up the most room. With this ammo we can then think about where to put our files to utilise our space more efficiently.

Due to the way I run the robots there is only one table that gets hit hard at any one time. This is because I do not have enough disks to seperate all the tables and indexes required to run the database. This also means that I can store the two tables quite easily on the same disk without performance being affected. It also means that the indexes can reside on the scsi disk.

The one disk that I need to avoid using is the /dev/hdc disk. This is an old Samsung SV2044D and it has a small cache and a crap seek time. I am also aware that it is getting on in years so I intend to move my operating system off it in the near future and possibly use if for storing tables I create from the data but little else.