Marc "Slyoldfox" Vanbrabant

Saturday, 25 April 2009

Blog moved

Dear readers,

My blog moved to http://marc.blog.foreach.com. You can read more posts there.

Kind regards,
Marc

Thursday, 26 February 2009

OC4J blocked threads ... part 2

And thus it became christmass and the new year 2009 was coming. When an upgrade was planned to the Oracle RAC so we could benefit from a new and faster storage.

Quite some preparation work had been done by the developers and the database administrators to ensure a (temporary) replication could be setup between the old RAC and the new RAC, so all data would be transferred more or less "automatically" into the new RAC.

And then the big day came. We stopped the OC4J containers, we changed the datasources while the database administrators synced the RACs a last time and when ready we started the OC4J containers back to run on the new RAC.

Uh-oh...blocked threads

We were happy the upgrade went smooth and the websites were running more crispier than ever until ... the containers started to lock up for some non-apparent reason. Nothing we hadn't done before, so I took a thread dump of the OC4J to see what was going on.

Thread t@788: (state = BLOCKED)
- oracle.jdbc.pool.OracleImplicitConnectionCache.retrieveCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=0, line=651 (Compiled frame; information may be imprecise)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=4, line=549 (Compiled frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=65, line=430 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=31, line=432 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String) @bci=16, line=203 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection() @bci=33, line=179 (Compiled frame)

This thread was actually deadlocked by

Thread t@29: (state = BLOCKED)
- oracle.jdbc.driver.PhysicalConnection.closeLogicalConnection() @bci=0, line=1581 (Compiled frame; information may be imprecise)
- oracle.jdbc.driver.LogicalConnection.cleanupAndClose(boolean) @bci=31, line=267 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.closeCheckedOutConnection(oracle.jdbc.pool.OraclePooledConnection, boolean) @bci=45, line=1503 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.runAbandonedTimeout(long) @bci=155, line=258 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.run() @bci=93, line=81 (Interpreted frame)

Also we were seeing a lot of connections to the OC4J TCP port which were on TIME_WAIT.

netstat -an | grep 12501 | wc -l
---> 444

A ticket was opened at Oracle in which we explained the situation. The first deadlock was caused by the abandoned-connection-timeout property which we had put to 60. Oracle advised us to remove the entry or put it on -1. We had this in our datasource just for security reason, if ever a query took more than 60 seconds we would break it off rather than keep it running for possibly much longer.

I found it a strange solution since we had not changed anything to the application level, never did we upgrade JDBC drivers or anything else ... but ok ... after setting the abandoned-connection-timeout property to -1 or just removing it, we started getting another behaviour. Now we would get Connection Timeout exceptions from the JDBC driver ... well ... kinda logica if you don't kill anything anymore, I guess the pool got used up for some reason. The stacktrace would look somewhat like this

08/12/23 03:32:26 java.sql.SQLException: Io exception: Io exception: Connection timed out
08/12/23 03:32:26 at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:132)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:200)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:264)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:575)
08/12/23 03:32:26 at oracle.jdbc.driver.ClobAccessor.getString(ClobAccessor.java:296)
08/12/23 03:32:26 at oracle.jdbc.driver.T4CClobAccessor.getString(T4CClobAccessor.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.OracleResultSetImpl.getString(OracleResultSetImpl.java:2138)

Ok we do heavily use Clobs, as sometimes was not adviced by some people who also run sister-sites. But avoiding a feature in Oracle because you know it might give you problems - we thought that was too far fetched. We saw this exception when using one of the latest JDBC-drivers available. As Oracle told us to "try this one out".

Meanwhile we started writing restart scripts, using a bit of python, when the socket would not answer after a decent period of time, we'd issue an OC4J restart so at least we could serve client request more or less decently. And since christmas and new year was coming closer, everything went just a tad slower.

What followed was a whole list of snoops and database traces. If you care to follow this SR just let me know your email-address as a comment. Meanwhile I wrote a wrapper for our DAO and noticed that queries that were using CLOBs were sometimes not finishing. This means that some of those queries stayed "running" and thus eventually we ran out of connections in the connection pool.

We were told to fiddle around with the sqlnet.expire_time. After experimenting a bit we found out that this value was set to 10 minutes. Setting the sqlnet.expire_time to 4 minutes helped us quite a bit, as it took much longer for JDBC to fill all te connections in the connection pool. At least our website was more stable (and needing less restarts).

Quite some days related we encountered the following snoop (IP addresses obfuscated)

61035 173.904541111.11.111.11222.22.22.222 TCP ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61073 174.318971111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61274 175.158981111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61998 176.839470111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
63332 180.198899111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911

This was said to be a retransmission problem by Oracle. Liking this explanation more we opened a ticket at Sun to see if they had noticed any issues like this. Sun coordinated with Oracle to look at the snoops and proposed us the following kernel parameter:

set ip:dohwcksum=0

And with this kernel parameter ... all problems went away. Now I really am not a Solaris expert to know what this kernel parameter does. I read it's a hardware checksum. I personally think it's weird to remove a "check" which is supposed to say if things are correct or not ... but hey ... Sun and Oracle told us to do it, and the problems are gone.

And after 3 months the end of this SR reached a Soft Close state and our website ran smoothly again. Never think a database upgrade might not impact your site obviously.

Friday, 21 March 2008

OC4J Blocked threads

Infrastructure Setup

It all started back in November 2007 when I was asked by my colleagues at Foreach if I could come and help out to do a performance review. The site in question was the site of De Tijd (or L'Echo in french) which was being ported from Microsoft SQL Server and ASP to a mix of Oracle Database (for articles) and Microsoft SQL Server 2005 (for quotations). The frontend would be handled by Oracle Application Servers (OAS) - with the usual Oracle Containers for Java (OC4J - version 10.1.3.3.0) and Apache servers above. In front of that there's the usual setup of load balancers and Oracle WebCache servers.

Performance review

Of course you start by the usual performance bottlenecks. Check primary keys, integrities, add unobstrusive logging to the DAO objects (log4j), start caching relevant objects with reasonable expire times (ehcache) ... do more stress tests, and so on ... Performance was reasonably satisfactory for launch and then that day finally came. The site was launched.

Blocked threads

It wasn't before long until we started seeing strange behaviours on the origin servers (the servers behind the webcaches). At random times they would just lock-up, needing a restart of the OC4J container about every day. After doing a thread dump we would notice several blocked threads like these:

frame: 0: com.evermind.server.http.HttpSite.getApplication(HttpSite.java:418)
frame: 1: com.evermind.server.http.AJPRequestHandler.initAJP(AJPRequestHandler.java:1011)
frame: 2: com.evermind.server.http.AJPRequestHandler.initRequest(AJPRequestHandler.java:615)
frame: 3: com.evermind.server.http.AJPRequestHandler.run(AJPRequestHandler.java:243)
frame: 4: com.evermind.server.http.AJPRequestHandler.run(AJPRequestHandler.java:187)
frame: 5: oracle.oc4j.network.ServerSocketReadHandler$SafeRunnable.run(ServerSocketReadHandler.java:260)
frame: 6: com.evermind.util.ReleasableResourcePooledExecutor$MyWorker.run(ReleasableResourcePooledExecutor.java:303)
frame: 7: java.lang.Thread.run(Thread.java:595)

Most of them were also pointing to the URLs of the detail of an article. After finding no relevant bottlenecks in our application. We did notice a lot of BLOCKED threads when we did a stress on the article page. If only I could look at HttpSite.getApplication(). I tried finding the source code for Orion, but of course that failed. So I reverted to look at the bytecode from oc4j-internal.jar *coughs*.

Once you head over there you will see that this method looks for a contextCache which is based on the path. The path is actually just the name of the URI in this case (without the querystring). You will also notice there's a JVM lock there - when the path is not found, creating a new entry in the contextCache. So euhm wait ... we have pretty URLs ... does that mean we are "poluting" the contextCache with entries to our pretty URLs and locking the whole thing down for no reason ?
Seems like it was ... we needed a quick fix so we changed the backend to work with non-pretty URLs and used a ProxyPass to put the logic of turning pretty URLs in non-pretty URLs.

This means you end up with 2 virtual hosts. One that does the frontend serving with pretty URLs and ProxyPasses to ... the same machine (other vhost) with non-pretty URLs.

Result

No more blocked threads. contextCache size was kept to a minimum. Apache handles the ProxyPasses quite well. Stable running sites since then.
Are there drawbacks ? Yes:

You are doing things more than once by ProxyPassing
You loose your pretty URLs and need to rebuild them when generating page ... but hey that can be done pretty quick performance wise
You have to be careful how you configure your applications and mount points.
You have to be careful how you redirect ... You don't want to redirect to a non-pretty URL (which might only resolve internally anyway).

So seems like having your pretty URLs handled by OC4J was not a good idea. I can't keep but wonder if Tomcat would have had any issue with this...