Welcome to the Disruptive Library Technology Jester. From here you can browse the musings and visions of a library technologist as he walks the fine line between the best of the library profession on one side and the best of technology on the other.

You can navigate through DLTJ several ways. Your first stop might be the introductory material about this blog and the jester himself under the "about" heading to the left. Another way would be to pick a facet below to browse: "by cagetory" for a rough categorization of postings, "by tags" for a finer granularity of topics, or "by date" for a chronological view. Third, use the search box in the left column as a keyword approach to content in DLTJ. And last, recent postings by the Jester can be found below the faceted list.

I hope you enjoy your visit. Please feel free to leave comments where you'd like or contact me directly.


Recent Posts

Killing Off Runaway Apache Processes

Well, something is still going wrong on dltj.org — despite previous performance tuning efforts, I’m still running into cases where machine performance grinds to a halt. In debugging it a bit further, I’ve found that the root cause is an apache httpd process which wants to consume nearly all of real memory which then causes the rest of the machine to thrash horribly. The problem is that I haven’t figured out what is causing that one thread to want to consume so much RAM — nothing unusual appears in either the access or the error logs and I haven’t figured out a way to debug a running apache thread. (Suggestions anyone?)

In any case, I whipped up this little ditty that is running every five minutes in cron as a way to gloss over the problem for the moment. Running as root, it looks into all of the processes in the virtual /proc file system, specifically in the ’stat’ file, and using awk looks to see if the second space-delimited value is the name of the httpd process (this is the Gentoo Linux distribution, so the name of the process is apache2) and the 23rd space-delimited value (the virtual size of the process) is bigger than 800MB. If so, it prints out the PID of the process (the first value in the stat file) at which the bash script unceremoniously sends it a kill (’-9′) signal. The script looks like this:

Code (bash)
  1. #!/bin/bash
  2.  
  3. for i in `/bin/ls -d /proc/[0-9]*`; do
  4.         if [ -f $i/stat ]; then
  5.                 pid=`/bin/awk '{ if ($2 == "(apache2)" && $23 > 800000000) print $1}' $i/stat`
  6.                 if [ "$pid" != "" ]; then
  7.                         echo "Killing $pid because of load average: `awk '{print $1}' /proc/loadavg`"
  8.                         kill -9 $pid
  9.                 fi
  10.         fi
  11. done

If anyone has any suggestions as to how to narrow down what the problem might be, I’d appreciate hearing from you. I’ve tried eliminating Wordpress plugins, recompiling Wordpress and Apache, and attempted to catch the behavior with a network traffic sniffer, but have come up empty so far.

Comments (0)

Permalink

JPEG2000 for Digital Preservation

Last month was an interesting month for discussion and news of JPEG2000 as an archival format. First, there was a series of posts on the IMAGELIB about the rational for using JPEG2000 for master files. It started with a posting by Tom Blake of Boston Public Library asking these questions:

What can I do with a JPEG200 that I can’t do with a TIFF, a good version
of Zoomify, and a well-designded DAMS?

Don’t you need to rely on a proprietary version/flavor of JPEG2000 and a
viewer to utilize its full potential?

Bill Snead from Duke offered pointers in a follow up message to Aware’s “Why JPEG2000?” whitepaper, the Olsen-Melville case study from the University of Connecticut, and Princeton’s statement of use of JPEG2000 as surrogates from a TIFF master.

I leapt into the conversation by offering an opinion that with JPEG2000 is a compelling replacement for a TIFF-based practice because:

  1. JPEG2000 offers a single format for both access and preservation of digital imagery. Eliminating the complexity of managing derivatives in the creation, processing, and delivery of images is a good thing, I think. Said another way, the archival master can be the same file as the production master with the access derivatives being generated on-the-fly based on the inherent-to-JPEG2000 scaling capabilities. Also, if one’s preservation master is your access master, then one will know very quickly if something is wrong with the preservation master — it no longer renders in your access system.

Comments (1)

Permalink

Fair Use Versus the NFL with YouTube Caught in the Middle

Here is something to keep an eye on. Via the Chronicle of Higher Education, Wendy Seltzer, a visiting assistant professor at Brooklyn Law School and Fellow at the Berkman Center for Internet & Society at Harvard Law School, is demonstrating the concept of fair use to her class by going head-to-head with the National Football League. Specifically, she posted a 30 second video snippet of the NFL’s standard copyright statement to YouTube on February 8th and waited to see what would happen.

As could be expected, Seltzer received a DMCA takedown notice five days later and the content is no longer viewable on YouTube. In response, she sent a counter notification exerting fair use rights to use the excerpt as an example of overreaching copyright warnings. (Getting dizzy from all of the claims and counter claims yet?)

So the ball is back in YouTube’s court as they are stuck in the middle between Prof. Seltzer on the one side and the NFL on the other. Check out the comments to the three above postings for a real interesting take on what has happened so far and keep an eye on http://www.youtube.com/watch?v=a4uC2H10uIo to see what happens early next month.

Comments (0)

Permalink

WordPress/MySQL Tuning

dltj.org runs on a relatively tiny box — a Pentium III with 512MB of RAM. I’m running a Gentoo Linux distribution, so I actually have a prayer of getting useful work out of the machine (it server is actually a recycled Windows desktop), but the performance just wasn’t great. As it turns out, there are several easy things one can do to dramatically improve life.

The Configuration

The box is both a mail server (IMAP) and a WordPress server. A rough eyeball at the process accounting on the server shows that it spends about 40% of the time doing mail (mostly taken up by Clamscan virus scanning and spam checking) and another 40% doing MySQL and web stuff. Since there isn’t much dynamic content on the box and nothing else using the database but WordPress, I’m fairly confident that blog traffic is almost all of that 40%. I’m using MySQL 5.0.x, Apache 2.0.x and WordPress 2.0.x with about two dozen plugins.

Taking PHP Up A Notch

PHP is an interpreted programming language, meaning that each time a script runs it needs to be translated into something closer to machine code (called the ‘opcode’). (As opposed to compiler languages like C and Java where you compile the source code into an executable in one step and then run that executable in a second step.) For an application like WordPress, where the source code is not changing, this translation causes a lot of overhead. Fortunately, there is a PHP plug-in called the Alternative PHP Cache that will saved the translated opcode the first time the script runs and use it for subsequent invocations. Getting this set up is pretty easy (these are Gentoo-specific commands, your Linux distribution will vary and I am glossing over a number of distribution-specific details like how to install packages and where the configuration files will reside):

  1. emerge -aDNtuv pecl-apc will download and install PHP APC and its dependencies (yep — that easy…I love Gentoo)

Comments (2)

Permalink