The View from the Moon

links

Solaris Zones Paper
Solaris and USB Cameras

Sun Bloggers
Weblog
Login

Notable New Features in Solaris "Nevada", Build 10 (04/2005)

Desktop

The X.org X server is updated to 6.8.2 final release.
Annoying mozilla bug (the "5.10.1" bug) fixed.

Performance

Single-threaded standard I/O performance gets a boost; we pick up about 25% on printf(3c) compared to Solaris 10, and about 2x on putchar(3c). This fixes a regression against Solaris 9.
Sherry contributed some sophisticated work to reduce cache pollution on Opteron systems by employing non-temporal access (i.e. accesses which do not dirty or displace lines in the L2 cache). Before this change, read(2) and write(2) typically involve loading from a source buffer and writing to the target buffer. Both would result in having lines installed in cache. It turns out that very often we don't access the data being written immediately, which means that to facilitate the copy, we end up replacing cachelines that we do need. One of the keys was working out which cases benefit from non-temporal access, and which cases are harmed by it.

Developer Support

Not-exactly-in-Solaris-10 but very worthy of your attention: For Java developers, a JVMTI/JVMPI provider for DTrace-- you can instrument method entry/exit, object allocation/free, garbage collection, etc. and blend it all in with the rest of your DTrace probes! See Adam's and Bryan's articles about this.

Networking

TCP keepalive probing period is now tunable via an ndd parameter, tcp_keepalive_interval. And via a socket option, TCP_KEEPALIVE_THRESHOLD. See tcp(7P).
The S2IO 10-gigabit driver (xge) has been updated; the company is now called Neterion. The update improves performance, and adds some feature enhancements, including Jumbo-frame support.

Other

SunVTS updated from 6.0 to 6.1

(2005-04-25 04:25:01.0) Permalink Comments [4]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/what_s_new_in_solaris5

Sunday April 17, 2005

OpenSolaris, Solaris 10 at MySQL User's Conference
This week is the MySQL User's Conference. Stephen Harpster is going to give a talk entitled OpenSolaris: Innovation Happens Everywhere on Wednesday at 11:20am. Attend to learn all about the OpenSolaris project!

Following that, and on the heels of last week's BoF at USENIX '05, I'm throwing together a BoF entitled MySQL and Solaris 10. It is on Wednesday at 9:30pm in the Magnolia Room of the Westin Santa Clara.

During the BoF we hope to talk about what makes Solaris 10 a good platform for running MySQL, and see what else we need to do to improve the OS in order to run MySQL. If you're there, please join us!
(2005-04-17 22:04:10.0) Permalink Comments [1]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/solaris_10_bof_at_mysql

Wednesday April 13, 2005

Live, from Anaheim
Tonight I led the Solaris 10 BoF session at USENIX '05. Bart, Liane, Alan, David Bustos, Matt, and John Clingan were there to help. We also spotted Rich, Jim and other Solaris Luminaries. John said that we had about 80 non-Sun folks in attendance, which I think is pretty good. The BoF ran from 8pm-11pm, and we didn't escape the room until about midnight, when David turned out the lights and forced us all up to the hotel bar, where we stayed until 2am! Thanks for coming, everyone!

I give the crowd an "A" for insightful questions, a willingness to share opinions, and a lot of discussion about Sun, Solaris and OpenSolaris in the marketplace, in academia, and in research. For me, it was interesting to contrast the discussion with the one we had at LISA '04, which was focused on issues like DHCP, LDAP, Jumpstart. I was a little less happy with my own performance-- maybe it was the Claritin, the lack of sleep, the scent of DisneyLand in the air, or whatever, but I was less coherent than I had hoped to be. If my introduction to Zones, DTrace, The details of the CDDL License (see also Andy's blog), or anything else was lacking, check out the aforementioned links, or leave me a comment.

If you're in town for USENIX don't miss Liane's Developer BoF tonight (Wednesday evening)! This isn't a replay of Tuesday's BoF. She'll lead a deeper tour of DTrace, SMF, /proc tools and other developer topics. Ok... time for sleep.
(2005-04-13 03:28:13.0) Permalink Comments [1]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/live_from_anaheim

Wednesday April 06, 2005

Smacking Super-Smack into Shape for Solaris
A recent article (and part 2) documented the author's attempts to benchmark MySQL performance on a variety of operating systems. He noted that a popular MySQL benchmark called Super-Smack doesn't compile on Solaris. I hate when this happens-- inevitably this tells the reader that Solaris isn't really a serious platform for MySQL (how could it be, if the benchmark doesn't work?). But from the other benchmarks the author provides, we can see that Solaris performs respectably; and I suspect that MySQL has itself not received as much tuning under Solaris as, for example, under Linux. With the help of performance analysis tools like DTrace, trapstat, etc. we (or you, gentle reader) can fix that.

First, I'd like to clarify one claim in the article: Solaris 10 is bundled with a compiler. That wasn't true in the beta build the author used; but it is true as of the FCS build. So, the benchmark will compile without installing any additional software.

I decided that it was time to get Super-Smack working under Solaris. The first task was to get the program to compile. The configure script ran OK, and I elected to just use the MySQL included in Solaris 10. For more serious benchmarking, I would study this to decide whether to build MySQL myself. So:

$ PATH=/usr/bin:/usr/sbin:/usr/sfw/bin:/usr/ccs/bin
$ export PATH
$ ./configure --with-mysql \
     --with-mysql-lib=/usr/sfw/lib/ \
     --with-mysql-include=/usr/sfw/include/mysql \
     --prefix=/home/dp/super-smack

I then had to make a few edits: src/Makefile needs to link the benchmark with the additional libraries -lsocket -lnsl. A proper autoconf setup should detect this, but... no problem. A few minor edits to C files were also needed:

Added #include <strings.h> to engines.cc for bzero.

In query.cc, replaced calls to flock() with calls to fcntl(3c):

  -  flock(1, LOCK_EX);
  +  fcntl(1, F_SETLK, F_WRLCK);

It turns out that flock is used only sparingly, at the end of the benchmark run, so we don't need to pay attention to any performance implications.

So now, it built cleanly! Hooray. Next, I muddled my way through getting mysqld started. Once I did, I had to cope with one more problem: Super-Smack, by way of libmysqlclient, seems to want to access the mysql database via a UNIX domain socket at /var/lib/mysql/mysql.sock. However, the database seems to put that socket in /tmp/mysql.sock. I wasn't sure why, and I decided to investigate that discrepancy out later. I hacked things up by putting an appropriate symlink in /var/lib/mysql to work around the problem.

Next, I ran Super-Smack as instructed in the article, and things went somewhat haywire. A quick look revealed that Super-Smack has a fairly conventional design: A parent process forks a bunch of children, which do benchmark activities. When these are finished, they write information back to the parent. I received a variety of error messages, and after applying truss to some abbreviated runs, Jonathan and I decided that the parent super-smack process was exiting prematurely, and failing to collect the data being sent to it by its children. A quick scan of the source code led me to this innocuous looking line of code:

      pid_t pid = wait4(-1, 0, 0, NULL);

This is where the master super-smack process waits for its children. My brokenness-sense was tingling. That -1 just looks wrong. And, for Solaris, it is. This first argument, -1, is the pid to wait for. In Linux's wait4, this is implemented as follows (excerpted from the linux man page):

    < -1
        which means to wait for any child process whose process group
        ID is equal to the absolute value of pid. 
    -1
        which means to wait for any child process; this is equivalent
        to calling wait3. 
     0
        which means to wait for any child process whose process group ID
        is equal to that of the calling process. 
    > 0
        which means to wait for the child whose process ID is equal
        to the value of pid.

So now we know what the author meant: Wait for any child process. While not fully documented (which is a bug), Solaris implements a slightly different ruleset:

     < 0
        which means to wait for any child process whose process group
        ID is equal to the absolute value of pid. 

     0
        which means to wait for any child process; this is equivalent
        to calling wait3. 

    > 0
        which means to wait for the child whose process ID is equal
        to the value of pid.

So, on Solaris, wait4(-1, ...) instructs the OS to wait for any child process whose process group ID is 1, while on Linux, it does a wait3(). damn. A final note is that wait4() is not defined by POSIX, the Single Unix Spec, or any standards body I could find. Please, write portable code! One wonders why wait3() wasn't used in the first place. Quickly changing the code fixes the problem.

At this point, I have what appears to be a working Super-Smack on Solaris, and some initial results. I'll intentionally not mention what hardware this was run on, since I've not bothered to perform even rudimentary performance analysis:

    $ super-smack /smacks/select-key.smack 10 10000 
    Query Barrel Report for client smacker1
    connect: max=330ms  min=6ms avg= 59ms from 10 clients 
    Query_type      num_queries     max_time        min_time        q_per_s
    select_index    200000          0               0               8590.39

mpstat(1m) shows that this benchmarks spends a lot of time abusing the system call path, and twiddling bits in userland; it's not clear whether this is really a good test of MySQL performance, since the test client and the database have to fight for CPU resources... All in all, not a terrible night's work. I owe Jonathan a big thanks for his help!
(2005-04-06 11:30:01.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/smacking_super_smack_into_shape

Monday March 28, 2005

What's New in Solaris Express 3/05 (Nevada Build 9)
I got chastised for not blogging often enough! I'll try to do more in the coming month. For now, Solaris Express 3/2005 is just around the corner (I believe it will come out tomorrow, 3/29/2005); here's a rundown on what's new.

Notable New Features in Solaris "Nevada", Build 9 (03/2005)

Desktop

New "Never Print Banner" option in the Solaris Print Manager.

Hardware support

MPxIO (Solaris's Multipath I/O feature) compatibility problems with IBM FAStT900 and FAStT600 arrays have been corrected.
A significant bug, 5042195 "Only part of disk is usable by fdisk or format on Solaris X86" has been fixed.
CD-ROM/DVD DMA is now always enabled; this had in the past caused problems with some CDROM and DVD drives. However, the performance benefit is significant. CD/DVD DMA can still be switched off via the configuration assistant. This is also known to cause problems with encrypted DVDs; to fix, disable DMA; this should be fixed when snv_11 comes out.

Security

A new command, embedded_su(1M) allows an application to prompt for credentials and execute commands as the super user or another user using su(1M) as a backend. This makes it easy (easier) to develop non setuid GUIs which invoke privileged actions. Cool!

Performance

rand_r(3c), rand(3c) and pthread_once(3c) are faster. malloc(3c) and free(3c) are slightly faster.

Developer Support

The libc atomic_ops(3c) have been expanded. See atomic_cas(3c), atomic_bits(3c), atomic_swap(3c) and membar_ops(3c). These are great for writing tricky code which maintains portability across ISAs. Thanks to Jonathan for pointing out that I forgot to mention this.
The kernel gets a suite of handy atomic data manipulation routines, similar to those provided by atomic_ops(3c) (including all the new routines highlighted above). See atomic_ops(9F), atomic_bits(9F), etc. (but note a bug in the man pages: kernel code must #include <sys/atomic.h>, not <atomic.h>).
plockstat(1m) and lockstat(1m) pick up some new options. plockstat gains "-e " (limit elapsed tracing time), "-n " (limit entries printed in output), and "-v" (print a message to indicate that tracing has started). Both commands acquire a "-x " option, which enables further tuning by setting various DTrace tunables.

Networking

The Network Layer 7 Cache (NL7C) revises the kernel NCA (Network Cache and Accelerator) by moving NCA's HTTP layer and object cache into the kernel's socket layer. Previously, NCA provided a completely separate TCP/IP stack inside the kernel, in order to provide the highest possible performance for web servers which were NCA-enabled. With the development of the FireEngine TCP/IP stack in Solaris 10, this extra TCP/IP stack can now be expunged. NL7C also further improves upon NCA performance by providing lower first-byte latency. Prefetch for sendfilev(3ext) is also added. Applications which already use the NCA apis are supported without modification. NL7C provides a framework for accelerating other L7 protocols in the future.
dhcpsvc.conf(4) and the dhcpmgr(1M) gui have a new "Owner IP" option. This allows you to optionally specify which IP address "owns" the dhcp network records a Solaris DHCP server manages. This is used by the server to determine which dhcp_network(4) records it is allowed to allocate. This feature is especially useful in cases where the DHCP server needs to be moved temporarily to a different system or address, and also in cases where the server may not have a stable IP address.
TCP and UDP ephemeral port selection is now randomized; this uses a high quality random number source in the kernel, raising the difficulty level of forging a valid RST.

Other

New "poolbind -e" option allows one to easily run a command, binding that command to the resource pool in question.

(2005-03-28 23:20:01.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/what_s_new_in_solaris4

Tuesday March 01, 2005

What's New in Solaris Express 2/05 (Nevada Build 7)
Welcome to Nevada! Nevada is our code-name for the next version of Solaris. For now at least, the uname -r output is 5.10.1 although that is subject to change. As I did for the last couple of Solaris 10 SX builds, I'll attempt to keep you abreast of the changes happening in each SX release. I missed doing one which described the delta between SX 11/04 (s10_72) and the FCS build of Solaris 10. The most important of those changes include:

Inclusion of gcc for SPARC, x86 and AMD64 (/usr/sfw/bin/gcc).
Intel 10GB NIC driver (ixgb driver)
svcadm (part of SMF) picked up a synchronous mode (via -s)
BIND 9 became the default name server. BIND 8 was removed.
A large fraction of binaries delivered (including kernel modules) are now cryptographically signed. See elfsign(1).

Ok. Now entering Nevada. Notable New Features in Solaris Nevada, Build 7 (AKA Solaris Express 02/2005):

Desktop

Updated Xorg from 6.8.0 to 6.8.2RC2, including numerous bug fixes and new hardware support (see the X.org release notes). The final version of 6.8.2 will be available in a future Solaris Express build.
An annoying bug in the /usr/sfw/bin/mozilla prevents it from starting up properly. Edit the OS_VERSION check in the script to work around the problem.
You can now double-click .jnlp (java web-start) and .jar files to run them under GNOME.

Hardware support

via823x SADA audio driver on x86 and AMD64 platforms.
Chelsio 10gb NIC driver available on all platforms (SPARC, x86, AMD64).

Security

64-bit openssl(1) command available. Solaris already ships with a 64-bit openssl library. The openssl command provides a tool for using various cryptography functions of OpenSSL's crypto library from the shell.
Support for a PKCS#11 "MetaSlot". This is an extension to the Cryptographic framework which presents a single slot which is the union of the capabilities of other slots which are loaded in the framework.
IKE gets a performance boost by using the encryption framework. IKE is also now fully compliant with RFC 3947 (NAT-T support).

Storage

iSCSI devices are now supported via the new iscsiadm(1m) command.
The fcinfo(1m) utility is now available; this utility can be used to list fibre channel ports on the system in a concise and clear fashion.

Performance

Hierarchical (Multi-level) Lgroup support. Solaris has an abstraction called an Lgroup (latency group) which is the way in which the system tracks NUMA system topology. Traditionally, Solaris has run on systems with no difference in latency (traditional SMP systems) or only two levels of latency (local memory and remote memory). Newer system designs have more levels. For example, 4-CPU Opteron systems have 3 such levels; 8-way Opteron systems may have 4 levels. This project enables better performance on ring and ladder system topologies, and picks up performance wins on Oracle (TPC-SO), Fluent, and other benchmarks. There are some new liblgrp APIs to go along with this work (lgrp_latency_cookie(3LGRP)).
Faster memmove(3c) (anywhere from 0-400%, 40% is typical) on 32-bit x86 platforms. AMD64 performance of memmove(3c) and bcopy(3c) were also improved.
Improved context switch performance on AMD64.
Much improved performance on 32-bit x86 string functions: strcpy(3c) (as much as 50%), strlen(3c) (as much as 25% on long strings) and strchr(3c) (as much as 45% on long strings).

Other:

New TCP_INIT_CWND TCP socket option allows the congestion control window calculation to be overridden with a user specified value. See tcp(7p) for full details.

(2005-03-01 21:20:00.0) Permalink Comments [6]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/what_s_new_in_solaris3

Squid startup: Extreme Makeover with SMF
I run a web proxy server for folks in the office; we use it as a longterm testbed for Solaris. But, in the insanity leading up to the release of Solaris 10, I've had little time to work on it. Recently I got a new server to host the cache; and so I've been busy putting it together. We've always used Squid as our proxy software. Personally, I have some qualms about Squid's design, but with five years of experience using it, I think we'll probably stick with it for now. It's a curious thing that there doesn't appear to be a significant competitive open source alternative to Squid (the forthcoming Apache 2.1 is moving mod_cache out of "experimental" support so perhaps that will be worth considering?).

Setting aside design complaints, being able to effectively administer Squid is a big priority, so recently I worked on getting it properly under the control of the Service Management Facility (SMF). It's also a good example of how to improve a program's administrative controls with SMF.

The first task was to look through Squid's existing start/stop/restart capabilities. There's a RunCache script, which I had always thought was the supported way to start the daemon. Looking at the documentation, RunCache is now aparently obsolete, but still installed along with squid anyway (sigh). RunCache has many problems which I won't detail here.

In the same neighborhood, there is the squid binary, which has a number of relevant command line options:

Usage: squid [-dhsvzCDFNRVYX] [-f config-file] [-[au] port] [-k signal]
...
       -f file   Use given config-file instead of
                 /aux0/squid/etc/squid.conf
...
       -k reconfigure|rotate|shutdown|interrupt|kill|debug|check|parse
                 Parse configuration file, then send signal to 
                 running copy (except -k parse) and exit.
       -s        Enable logging to syslog.
...
       -z        Create swap directories
...
       -N        No daemon mode.
...

To add to the complexity, squid has its own restarter directly built into itself. This is somewhat suboptimal, as SMF tends to trump these facilities, and allows monitoring software to have visibility into restart events. Anyway, we can make use of the -k option to control the daemon to some degree, and give the administrator the power to create multiple service instances if we use the -f option. In my testing, I found the -k reconfigure option to be somewhat useless, so I decided not to implement an SMF 'refresh' method. Perhaps I missed something?

Another problem we'd like to solve is that Squid doesn't operate properly "out of the box." First, one must run the daemon with the -z option in order to create the cache metadata. I'm not sure why the squid team made this decision; I certainly don't think it's a good one. Our startup scripting can simply take care of cache creation for the administrator. After working out the right set of dependencies for the cache as I'd set it up (./configure --disable-internal-dns --enable-ssl --prefix=/aux0/squid --enable-storeio='ufs aufs'), I prepared a service manifest file which captured those dependencies; the dependencies look like this:

$ svcs -d squid 
STATE          STIME    FMRI
online         Jan_26   svc:/milestone/network:default
online         Jan_26   svc:/system/filesystem/local:default
online         Jan_26   svc:/network/dns/client:default
online         Jan_26   svc:/milestone/sysconfig:default

The network milestone is the stable way to depend on "networking being up on the box." A buglet in some of the S10 FCS manifests (notably, Apache) is that some of them have finer grained, and less stable dependencies (for example, on network/physical). When stable dependencies in the form of milestones are available, please use them.

Note that the default mode for squid is to use it's own internal DNS library (ugh), so you may or may not need the DNS dependency. This is (double ugh) a compile time setting. Regardless, you'll want to have an /etc/resolv.conf file present, and the network/dns/client manifest checks for that.

Next, I worked on revising the startup script to be much more intelligent. To start up the cache, it uses squid's -k parse option to decide whether the configuration file has a valid syntax. If not, it exits with the $SMF_ERR_CONFIG error code, which indicates a configuration problem. Next, it populates the cache directory using squid -z as needed. Finally, it starts up the cache. Every failure logs a clear and detailed log message.

I also added a couple of service properties, which the script uses to set its behavior. Ideally, this will be automatically and correctly generated from the configure script in the future. Just tweak the manifest before importing it. In the example manifest, squid has been configured to be installed into /aux0/squid. You will want to search the file and alter all of the places which reference /aux0/squid, adjusting them for your installation (you can also use svccfg after you import the manifest to make corrections). Here is a draft of the network/http-proxy:squid service manifest; and a draft of the svc-squid startup script. To install:

Tweak squid.xml to reflect the Squid installation directory.
Copy the svc-squid script to the location reflected by squid.xml.
svccfg import /path/to/squid.xml
svcadm enable squid

I hope this is helpful! I'd be happy to take suggestions for improvement, and please let me know if you wind up using this successfully.

[Sigh. Sometimes I feel like I'm just too slow to post. Since I started this post a month ago, some of the work Trevor posted obviates mine. While I'm not happy about having multiple similar solutions to a single problem I think this represents a substantial improvement, and it did take quite a while to refine into the current state. It has also been checked and nitpicked by the SMF team, so I'm optimistic that it is roughly correct. One interesting result is that my dependencies are different than the set which Trevor worked out. Determining the right set of dependencies is, at present, a bit of a black art.]
(2005-03-01 04:45:01.0) Permalink Comments [3]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/squid_startup_extreme_makeover_with

Thursday February 24, 2005

The Year of the Rooster

I've been in Singapore for nearly a week, since leaving Tokyo. Sadly, I've spent much of the time sick, with a moderate cold. I had thought I was on the mend, but today I mostly lost my voice, just in time for my training presentations! Sigh...
Singapore is a pretty amazing mix of cultures. Before I got sick, I managed to get to the Lunar New Year's celebration (as you can see, it's the year of the rooster). There are some more pictures here.

(2005-02-24 00:25:00.0) Permalink Comments [0]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/singapore_redux

Tuesday February 15, 2005

Lost in Shinjuku

Jonathan and I have been spending the week in Tokyo, doing a series of training presentations and customers visits on various Solaris 10 topics. The culture shock is moderately intense, but it helps that everyone is friendly, the city is immaculately clean, and that there is always something new right around the next corner. We've had some outstanding food, and had a couple of days to do the usual touristy stuff. I posted some pictures of Tokyo. Tomorrow we will present to 250 customers, and then on Friday I will go on to Singapore...

(2005-02-15 19:52:28.0) Permalink Comments [0]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/lost_in_shinjuku

Wednesday January 26, 2005

Remote, Secure Zone Console Login
I have heard from a number of customers that folks would like remote login to zone consoles. In particular, they would rather not give out logins to the global zone in order to allow zone logins. (Really: I don't spend all of my time on the zones console...).

Fortunately, we can handle this in a nice way already. (Disclaimer: Please note that as stated by the script, the following techniques have not been subject to a rigorous security audit. I believe this technique to be sound, but neither I nor Sun warrant it to be so.)

To start, we'll add a user account to /etc/passwd for each zone we want to set up this way:

# cat >> /etc/passwd
z1:x:999999:999999:xanadu-z1:/tmp:/opt/extras/zoneshell
^D

# pwconv
# passwd z1
New Password: xxxyyy
Re-enter new Password: xxxyyy
passwd: password successfully changed for z1

In this case, the zone name is xanadu-z1 and we've picked a nice large UID and group ID. You could use whatever you like (but not a UID in use for something else! and never 0); you'll want a separate UID for each zone. In this case, /opt/extras/zoneshell is set as the z1 user's shell. We picked 'z1' as the account name because UNIX systems are typically limited to 8 letter account names (LOGNAME_MAX); since xanadu-z1 is 9 characters long (and zone names may be up to 64 characters long), we need to pick a convention to shorten things.

The zoneshell script is here; the script itself is very simple: it looks up the entry in /etc/passwd and executes zlogin -C for the zone named in the GECOS field.

Finally, we need to give the z1 account the ability to run zlogin; we do that by modifying the RBAC attributes for the z1 user.

# cat >> /etc/user_attr
z1::::profiles=Zone Management
^D

So, here's what it looks like:

$ ssh -l z1 xanadu
Password:xxxyyyy
Last login: Tue Jan 25 13:54:01 2005 from xxx
warning: using experimental, unsupported 'zoneshell'
[Connected to zone 'xanadu-z1' console]

I'd appreciate any feedback on whether this is helpful, or not!

To reiterate: this code is experimental, and has not been audited for its security characteristics. Use of this script is AT YOUR OWN RISK. Please use this as an example, from which you could derive your own implementation.
(2005-01-26 19:00:00.0) Permalink Comments [3]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/remote_zone_console_login

Sunday January 23, 2005

Clearing up confusion about zlogin, zones, consoles, and terminal types
Thanks to bloglines' nice search feed feature, I found this thread on the Solaris x86 Yahoo group.

Phillip asks why, when he issues a zlogin -C to a zone, it asks him which terminal type he'd like to use. For those who might not have seen zlogin before, it's a tool patterned on the syntax of rlogin and ssh; one uses it to enter a zone from the global zone. The "regular" way one would use it is as follows:

$ zlogin myzone

This will insert you into the zone with an appropriate subset (including $TERM) of your environment propagated from the global zone. So, if you are using an xterm and your $TERM is "xterm", then that will be propagated correctly into the zone. This is all implemented using pseudo-terminals (the same things used to make telnet, ssh, etc. do what they do); they are pretty easy to deal with-- when you need one, you create it from nothing, then start some processes which are connected to it in some fashion. You have full control of the process environment. In this mode, zlogin will never ask you what terminal type you have; if $TERM is unset in your global zone shell, it will either be unset, or default to something like dumb inside the zone, depending upon your shell.

Zones also possess a virtual console, which can be accessed using the zlogin -C command. And this is where Phillip is having problems. A console is fundamentally different from a pseudo-terminal. While a pseudo-terminal vanishes once you stop using it, a console (real or virtual) keeps its state; you can connect to and disconnect from it at any time. Users familiar with using the tip(1) command or other serial console systems know that they must often tweak some settings after attaching to a console. Think of the console as television-- the programs are always playing, regardless of whether the set is on or not; you can choose to watch the set or not by turning it on (i.e. connecting to the console).

Phillip recalls having already answered that question when he installed the system as a whole. In a subsequent post he is more critical, since it isn't intuitive why we ask for this information again. Since this is my work, hopefully I can show why this isn't "sloppy" as Phillip asserts, but rather an unavoidable artifact of the way UNIX consoles function.

To understand this, we need to turn to another important distinction: the terminal type of the system's console should usually be set to reflect the kind of hardware which comprises the physical system console. On Sun's SPARC boxes, this is sun and on x86 we have sun-color. This is important, because these terminal types are pretty much incompatible with, for example, the xterm terminal type.

On the other hand, if a machine's console is instead set to be one of its serial ports, and is accessed over a tip line, then the default terminal type is usually set to something fairly benign like xterms (xterm-small) or vt100 or the like-- but this setting must be made by the administrator because there is no protocol for serially connected terminals to identify their terminal type, a limitation of the hardware.

Zones emulates the latter sort of connection-- a zone console is analogous to a serially connected tip line. At one end is your terminal, the type of which is not automatically known to the console at the other end. It is probably an xterm (xterms), a gnome-terminal, a dtterm, or the like. It might also be a vt220, a wyse or any of hundreds of others. So, just as we do at first system boot (if we can't work out what type of terminal we are connected to by querying the openprom device), we query the user the first time the zone is booted. After that, we'll remember this setting. I suppose that we could have just defaulted to something such as 'vt100' but that also seems unfriendly; the sysid tools (the stuff that asks you for your hostname, timezone, etc.) make extensive use of curses, which tends to spam your terminal with garbage if it's idea of your terminal type doesn't match your terminal hardware (or emulator). We certainly can't default to the system's default setting, since that is highly unlikely to be compatible with your window system terminals; if the zone operates as though the terminal is sun and you are using an xterm, you won't be pleased.

It's also worth mentioning that you can automate away all of these first-zone-boot questions by employing an /etc/sysidcfg file.

Next up-- how do you change the terminal type if you've made a mistake during the sysid configuration? You'll know this happened if your screen is filled with gobbledygook characters when you'd normally see the "what is your hostname?" question. It's nice that you have the non-console zlogin available when you encounter situations like this. To repair things, log off of the zone console, and run:

# zlogin myzone /usr/sbin/sys-unconfig

This will halt the zone after blanking it. Boot the zone back up, log onto the console, and start again.

[Update: It's not the case that after you specify the terminal type for the sysid tools, this will automatically become the console terminal type; arguably, this would be good, but it also doesn't match the behavior of earlier Solaris releases. We'll take a look. See the tip below for how to set your console's terminal type]

What if you changed your mind about what the default terminal type of the console ought to be? The classic "big hammer" method is to simply run the sys-unconfig utility; this has the downside of pretty much blanking your system's networking configuration, but it is effective.

In older versions of Solaris, you can also edit the ttymon line in /etc/inittab. Starting in recent builds of Solaris 10, all of this is controlled by SMF, the new Service Management Facility; as a result, changing the terminal is pretty simple. To check your current setting:

$ svcprop -p ttymon/terminal_type system/console-login
sun

To see all of the ttymon family of properties, issue:

$ svcprop -p ttymon system/console-login
ttymon/device astring /dev/console
ttymon/label astring console
ttymon/modules astring ldterm,ttcompat
ttymon/nohangup boolean true
ttymon/prompt astring \`uname\ -n\`\ console\ login:
ttymon/timeout count 0
ttymon/terminal_type astring sun

And to change your console terminal type (as root):

# svccfg -s system/console-login 'setprop ttymon/terminal_type = sun'

So now we have a supported, upgrade-safe way to change all of the elements of the console's configuration. Sweet!
(2005-01-23 17:30:00.0) Permalink Comments [3]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/clearing_up_confusion_about_zlogin

Friday January 07, 2005

More on Bootchart for Solaris
So Eric Let the cat out of the bag on our reworked BootChart for Solaris. Eric posted one image; here is another: the bootup of a single Solaris Zone. I'm pretty happy with this, as we boot in only about 7 seconds.

This was an interesting experience because Eric and I had not previously worked together very closely. I had a great time doing this, and because Eric immediately stuck everything into a Teamware workspace, we were able to work simultaneously. Eric and I both worked on the D scripting, and somehow a joke about "wouldn't it be nice if the script output XML?" turned into our strategy. This turned out to be a good decision, as I hate writing parsers; instead we just let SAX do the work. We were able to maintain the split of having boot-log generation in one self-contained component, and the log processing into another. Because the XML logs are in an easily parsed format (as opposed to parsing the output of top and iostat), they can be useful to anyone doing boot analysis on Solaris. We've already had some such requests. I'm sure Eric will have more to say about the implementation so I'll leave it to him, except to say that some of the visual design changes can be blamed on me, taking inspiration from Edward Tufte's work.

Something else which fell out of this experience is that it's easy to use the log gatherer on any collection of processes which start up (as we saw in the zones example, above). We hope that this will be helpful in highlighting performance wins in the startup of complex ISV programs.

Following the experience of Linux developers, we've also found a series of bugs in Solaris with this tool. Let's start with an easy one, and the first one I found. The bug is visible in this chart from xanadu, my Shuttle SN45G4. Because we don't have support (sadly) for the on-board ethernet on this box, I had inserted another network card (I won't name the vendor, as I don't want to put them on the spot). If you look carefully at this bootchart, you'll see that the ifconfig process is spending a lot of time on the CPU. What's up with that? A brief investigation with DTrace made it clear that the driver had code like the following (shown here reduced as pseudo-code) in the attach(9E) codepath (attach is the process by which a driver begins controlling a hardware device):

    for (i := 0 to auto_negotiation_timeout) {
        if auto_negotiation_complete()
            return success;

        wait_milliseconds(100);
    }

Which all looks fine except that wait_milliseconds() (a function defined by the driver) is a wrapper around drv_usecwait(9F) ("busy-wait for specified interval"). Busy is of course the problem. drv_usecwait is really more about waiting for short intervals for various information to become ready in various hardware registers. Busy-waiting 100 milliseconds at a time is practically forever, and ties up the CPU just spinning in a loop. The authors almost certainly meant to use delay(9F). I filed a bug, and hopefully we'll have it fixed soon (since this driver comes to us from a third party, they request that we let them make the code changes). Fun, eh?

Another two issues we spotted concern inetd, which has been rewritten from scratch in Solaris 10; it is now a delegated restarter, which basically means that it take some of its direction from the system's master restarter (svc.startd). The behavior we noticed sticks out on any of the boot charts, including the zone boot chart mentioned above: inetd is starting a lot of very short-lived ksh processes. Why? When I first spotted this, I used DTrace to work out the answer, as follows:

# dtrace -n 'proc:::create/execname=="inetd"/{ustack();}'
...
restart inetd
...
  0  12188                     cfork:create 
              libc.so.1`__fork1+0x7
              libc.so.1`wordexp+0x16f
              inetd`create_method_info+0x45
              inetd`create_method_infos+0x2f
              inetd`read_instance_cfg+0xc8
              inetd`process_restarter_event+0x171
              inetd`event_loop+0xfd
              inetd`start_method+0x91
              inetd`main+0xcb
              inetd`0x8054712

Ahh, so we stumble upon an (embarrassing) implementation artifact of libc-- it uses ksh to help it implement the libc routine wordexp(3c). So, every time inetd needs to wordexp() something, we wind up running a ksh. We can also see that this is not severely impacting performance, but we would like to get this fixed. Personally, I'd like to see wordexp() fixed to not rely upon ksh at all.

Another somewhat more subtle issue is something that SMF engineers like Liane are still looking at. It's also visible in the zone boot chart. It appears that some services (such as nfs/client) are delayed in coming on-line because it is taking the startd/inetd cabal a while to mark their dependent services as online, even though that shouldn't really entail much work. We can see this as follows:

$ svcs -l nfs/client
fmri         svc:/network/nfs/client:default
name         NFS client service
...
dependency   optional_all/none svc:/network/rpc/keyserv (online)
dependency   optional_all/none svc:/network/rpc/gss (online)
dependency   require_all/refresh svc:/milestone/name-services (online)

$ svcs -l network/rpc/gss
fmri         svc:/network/rpc/gss:default
name         Generic Security Service
enabled      true
state        online
next_state   none
state_time   Fri Jan 07 18:22:32 2005
restarter    svc:/network/inetd:default
dependency   require_all/restart svc:/network/rpc/bind (online)
dependency   optional_all/none svc:/network/rpc/keyserv (online)

Since nfs/client depends upon GSS, it can't be started by startd until GSS is online. Liane and Jonathan have offered up some theories about why this is happening, but we've all been engaged on higher priority work, and so we haven't had much time yet to dig deeper. This serialization point appears to be costing us roughly 1 full second on boot, so it's something we need to look at further. Have a great weekend folks!

(2005-01-07 19:00:00.0) Permalink Comments [8]
Trackback: http://blogs.sun.com/roller/trackback/dp/Weblog/more_on_bootchart_for_solaris

« Whoohoo! gtkam & USB on Solar...