Quantcast
Channel: Oracle – Tanel Poder's Performance & Troubleshooting blog
Viewing all 76 articles
Browse latest View live

Enkitec is a finalist for the UKOUG Engineered Systems Partner of the Year Award

$
0
0

Enkitec has made it to the shortlist of UKOUG Partner of the Year Awards, in the Engineered Systems category. So if you like what we have done in the Exadata and Engineered Systems space, please cast your vote! :-)

Note that you need to be an Oracle user – using your company email address in order to vote. (the rules are explained here).

Thanks!!!


Scalar Subqueries in Oracle SQL WHERE clauses (and a little bit of Exadata stuff too)

$
0
0

My previous post was about Oracle 12c SQL Scalar Subquery transformations. Actually I need to clarify its scope a bit: the previous post was about scalar subqueries inside a SELECT projection list only (meaning that for populating a field in the query resultset, a subquery gets executed once for each row returned back to the caller, instead of returning a “real” column value passed up from a child rowsource).

I did not cover an other use case in my previous post – it is possible to use scalar subqueries also in the WHERE clause, for filtering the resultset, so let’s see what happens in this case too!

Note that the tests below are ran on an Oracle 11.2.0.3 database (not 12c as in the previous post), because I want to add a few Exadata details to this post – and as of now, 18th August 2013, Smart Scans don’t work with Oracle 12c on Exadata. This will of course change once the first Oracle 12c patchset will be released, but this will probably happen somewhere in the next year.

So, let’s look into the following simple query. The bold red part is the scalar subquery (well, as long as it returns 0 or 1 rows, if it returns more, you’ll get an error during query execution). I’m searching for “objects” from a test_objects_100m table (with 100 Million rows in it), but I only want to process the rows where the object’s owner name is whatever the subquery on test_users table returns. I have also disabled Smart Scans for this query so that the database would behave more like a regular non-Exadata DB for now:

SELECT /*+ MONITOR OPT_PARAM('cell_offload_processing', 'false') */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

Note the equals (=) sign above, I’m simply looking for a single, noncorrelated value from the subquery – it’s not a more complex (and unpredictable!) IN or EXISTS subquery. Let’s see the execution plan, pay attention to the the table names and execution order below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT "U"."USERNAME" FROM "TEST_USERS" "U" WHERE "USER_ID"=13))
   3 - filter("USER_ID"=13)

That sure is a weird-looking execution plan, right?

The TABLE ACCESS FULL at line #2 has a child rowsource which also happens to be a TABLE ACCESS FULL feeding rows to the parent? Well, this is what happens when the query transformation engine pushes the scalar subquery “closer to the data” it’s supposed to be filtering, so that the WHERE user_id = 13 subquery result (line #3) gets evaluated once, first in the SQL execution data flow pipeline. Actually it’s slightly more complex, before evaluating the subquery #3, Oracle makes sure that there’s at least one row to be retrieved (and surviving any simple filters) from the table in #2. In other words, thanks to evaluating the scalar subquery first, Oracle can filter rows based on the subquery output value at the earliest possible point – right after extracting the rows from the data blocks (using the codepath in data layer). And as you’ll see later, even push the resulting value to Exadata storage cells for even earlier filtering in there.

The first place where I usually look, when checking whether some transformation magic was applied to the query, is the outline hints section of the execution plan (which you can get with the ADVANCED or +OUTLINE options of DBMS_XPLAN):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
      DB_VERSION('11.2.0.3')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$2")
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      PUSH_SUBQ(@"SEL$2")
      FULL(@"SEL$2" "U"@"SEL$2")
      END_OUTLINE_DATA
  */

Indeed, there’s a PUSH_SUBQ hint in the outline section. So, as subqueries in a WHERE clause exist solely for producing data for filtering the parent query blocks rows, the PUSH_SUBQ means that we push the subquery evaulation deeper in the plan, deeper than the parent query block’s data access path, between the access path itself and data layer, which extracts the data from datablocks of tables and indexes. This should allow us to filter earlier, reducing the row counts in an earlier stage in the plan, thus not having to pass so many of them around in the plan tree, hopefully saving time and resources.

So, let’s see what kind of plan do we get if we disable that particular subuqery pushing transformation, by changing the PUSH_SUBQ hint to NO_PUSH_SUBQ (most of the hints in Oracle 11g+ have NO_ counterparts, which are useful for experimenting and even as fixes/workarounds of optimizer problems):

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'false') test3b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

And here’s the plan:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - filter("USER_ID"=13)

Now, both of the tables are at the same level in the execution plan tree and it’s the parent FILTER loop operation’s task to fetch (all the) rows from its children, for performing the comparison for filtering. See how the estimated row count from the larger table is 100 million now, as opposed to rougly 7 million in the previous plan. Note that these are just optimizer’s estimates, so let’s look into the SQL Monitoring details for real figures.

The original plan with subquery pushing transformation used 16 seconds worth of CPU time  - and roughtly all of that CPU time was spent in execution plan line #2. Opening & “parsing” data block contents and comparing rows for filtration sure takes noticeable CPU time if done on big enough dataset. The full table scan on TEST_OBJECTS_100M table (#2) returned 43280 rows, after filtering with the help of the pushed subquery (#3) result value, to its parent operation (#1):

===============================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes |
===============================================================================
|      47 |      16 |       31 |        0.00 |     1 |     1M | 11682 |  11GB |
===============================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |   100.00 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (31) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |
===========================================================================================================================================

Unfortunately the standard SQL rowsource-level metrics don’t tell us how many rows were retrieved from the tables during the full table scan (the table scan rows gotten v$sesstat metric would help there somewhat). Nevertheless, we happen to know that this scanned table contains 100M rows – and as I’ve disabled the Smart Scan offloading, we can assume that all these rows/blocks were scanned through.

Below are the metrics for the query with NO_PUSH_SUBQ hint, so that the FILTER operation (#2) is responsible for comparing data and filtering the rows. Note that the full table scan at line #3 returns 100 million rows to the parent FILTER operation – which throws most of these away. Still roughly 16 secons of CPU time have been spent in the full table scan operation (#3) and the FILTER operation (#2) where the actual filtering now takes place used 2 seconds of CPU, which indicates that the actual data comparison / filtering code takes a small amount of CPU time, compared to the cycles needed for extracting the data from their blocks (plus all kinds of checks, like calculating and checking block checksums as we are doing physical IOs here):

===========================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  | Activity |    Activity Detail    |  
|    |                                          |                   |       | (Actual) | Reqs  | Bytes |   (%)    |      (# samples)      |
===========================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |          |                       |  
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |          |                       |  
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |     4.35 | Cpu (2)               |  
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 11682 |  11GB |    95.65 | Cpu (16)              |
|    |                                          |                   |       |          |       |       |          | direct path read (28) |  
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |          |                       |  
===========================================================================================================================================

So far there’s a noticeable, but not radical difference in query runtime and CPU usage, but nevertheless, the subquery pushing in the earlier example did help a little. And it sure looks better when the execution plan does not pass hundreds of millions of rows around (even if there are some optimizations to pipe data structures by reference within a process internally). Note that these measurements are not very precise for short queries as ASH samples session state data only once per second (you could rerun these tests with a 10 Billion row table if you like to get more stable figures :)

Anyway, things get much more interesting when repeated with Exadata Smart Scan offloading enabled!

 

Runtime Difference of Scalar Subquery Filtering with Exadata Smart Scans

I’m running the same query, with both subquery pushing and smart scans enabled:

SELECT /*+ MONITOR PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4a */ 
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner)) 
FROM 
    test_objects_100m o 
WHERE 
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13) 
/

Let’s see the stats:

=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|    2.64 |    0.15 |     2.49 |        0.00 |     1 |     1M | 11682 |  11GB |  99.96% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3981286852)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL             | TEST_OBJECTS_100M |     1 |    43280 | 11682 |  11GB |  99.96% |   100.00 | cell smart table scan (3) |
|  3 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

Wow, it’s the same table, same server, the same query, but with Smart Scans enabled, it takes only 2.6 seconds to run the query (compared to previous 47 seconds). The database level CPU usage has dropped over 100X – from 16+ seconds to only 0.15 seconds! This is because the offloading kicked in, so the storage cells spent their CPU time opening blocks, checksumming them and extracting the contents – and filtering in the storage cell of course (the storage cell CPU time is not accounted in the DB level V$SQL views).

The Cell Offload Efficiency for the line #2 in the above plan is 99.96%, which means that only 0.04% worth of bytes, compared to the scanned segment size (~11GB), was sent back by the smart scan in storage cells. 0.04% of 11GB is about 4.5 MB. Thanks to the offloading, most of the filtering was done in the storage cells (in parallel), so the database layer did not end up spending much CPU on final processing (summing the lengths of columns specified in SELECT list together).

See the storage predicate offloading below:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   405K  (1)| 00:52:52 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |  7692K|   242M|   405K  (1)| 00:52:52 |
|*  3 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - storage("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
       filter("O"."OWNER"= (SELECT /*+ PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   3 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, somehow the whole complex predicate got offloaded to the storage cell! The catch here is that this is a simple, scalar subquery in the WHERE clause – with equality sign, and is not using some more complex IN / EXISTS construct. So, it looks like the scalar subquery (#3) got executed first and its result value got sent to the storage cells, just like a regular constant predicate. In other words, it’s not the subquery itself that got sent to storage cells (it’s impossible with the current architecture anyway), it’s the result of that subquery that got executed in the DB layer and used in an offloaded predicate. In my case the USER_ID = 13 resolved to username “OUTLN” and the storage predicate on line #2 ended up something like “WHERE o.owner = ‘OUTLN’“.

And finally, let’s check the FILTER based (NO_PUSH_SUBQ) approach with smart scan enabled, to see what do we lose if the subquery pushing doesn’t kick in:

SELECT /*+ MONITOR NO_PUSH_SUBQ(@"SEL$2") OPT_PARAM('cell_offload_processing', 'true') test4b */
    SUM(LENGTH(object_name)) + SUM(LENGTH(object_type)) + SUM(LENGTH(owner))
FROM
    test_objects_100m o
WHERE
    o.owner = (SELECT u.username FROM test_users u WHERE user_id = 13)
/

The query takes over 8 seconds of CPU time now in the database layer (despite some sampling inaccuracies in the Activity Detail that comes from ASH data):

Global Stats
=========================================================================================
| Elapsed |   Cpu   |    IO    | Application | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Offload |
=========================================================================================
|      11 |    8.46 |     2.65 |        0.00 |     1 |     1M | 13679 |  11GB |  68.85% |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3231668261)
=========================================================================================================================================================
| Id |                Operation                 |       Name        | Execs |   Rows   | Read  | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                          |                   |       | (Actual) | Reqs  | Bytes | Offload |   (%)    |        (# samples)        |
=========================================================================================================================================================
|  0 | SELECT STATEMENT                         |                   |     1 |        1 |       |       |         |          |                           |
|  1 |   SORT AGGREGATE                         |                   |     1 |        1 |       |       |         |          |                           |
|  2 |    FILTER                                |                   |     1 |    43280 |       |       |         |          |                           |
|  3 |     TABLE ACCESS STORAGE FULL            | TEST_OBJECTS_100M |     1 |     100M | 13679 |  11GB |  68.85% |   100.00 | Cpu (4)                   |
|    |                                          |                   |       |          |       |       |         |          | cell smart table scan (7) |
|  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS | TEST_USERS        |     1 |        1 |       |       |         |          |                           |
=========================================================================================================================================================

See how the Cell Offload % has dropped too – as the Cell Offload Efficiency for the line #3 is 68.85%, it means that the smart scan returned about 11 GB * 31.15% = 3.4 GB of data back this time! The difference is entirely because now we must send back (the requested columns) of all rows in the table! This is visible also from the Actual Rows column above, the full table scan at line #3 sends 100M rows to its parent (FILTER), which then throw most of them away. It’s actually surprising that we don’t see any CPU activity samples for the FILTER (as comparing and filtering 100M rows does take some CPU), but it’s probably again an ASH sampling luck issue. Run the query with 10-100x bigger data set and you should definitely see such ASH samples caught.

When we look into the predicate section of this inferior execution plan, we see no storage() predicate on the biggest table – actually there’s no predicate whatsoever on line #3, the filtering happens later, in the parent FILTER step:

------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |     1 |    33 |   406K  (1)| 00:52:55 |
|   1 |  SORT AGGREGATE                        |                   |     1 |    33 |            |          |
|*  2 |   FILTER                               |                   |       |       |            |          |
|   3 |    TABLE ACCESS STORAGE FULL           | TEST_OBJECTS_100M |   100M|  3147M|   406K  (1)| 00:52:55 |
|*  4 |    TABLE ACCESS STORAGE FULL FIRST ROWS| TEST_USERS        |     1 |    12 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

   2 - filter("O"."OWNER"= (SELECT /*+ NO_PUSH_SUBQ */ "U"."USERNAME" FROM "TEST_USERS" "U" WHERE
              "USER_ID"=13))
   4 - storage("USER_ID"=13)
       filter("USER_ID"=13)

So, in conclusion – scalar subqueries in WHERE clauses do provide a little benefit (reduced CPU usage) in all Oracle databases, but on Exadata they may have a much bigger positive impact. Just make sure you do see the storage() predicate on the relevant plan lines scanning the big tables. And always keep in mind that the existence of a storage() predicate doesn’t automatically mean that smart scan did kick in for your query execution – always check the exadata specific execution metrics when running your query.

Oracle Performance & Troubleshooting Online Seminars in 2013

$
0
0

In case you haven’t noticed, I will be delivering my Advanced Oracle Troubleshooting and Advanced Oracle Exadata Performance: Troubleshooting and Optimization classes again in Oct/Nov 2013 (AOT) and December 2013 (Exadata).

I have streteched the Exadata class to 5-half days as 4 half-days wasn’t nearly enough to deliver the amount of details in the material (and I think it’s still going to be a pretty intensive pace).

And that’s all for this year (I will write about conferences and other public appearances in a separate post).

 

I will be speaking at Oracle OpenWorld and Strata + HadoopWorld NY

$
0
0

I will be speaking at a few more conferences this year and thought to add some comments about my plans here too. Here’s the list of my upcoming presentations:

Oracle OpenWorld, 22-26 September 2013, San Francisco

  • Session: Moving Data Between Oracle Exadata and Hadoop. Fast.
  • When: Wednesday, Sep 25. 3:30pm
  • Where: Moscone South 305
  • What: I have been doing quite a lot of work on the optimal Oracle/Exadata <-> Hadoop connectivity and data migration lately, so thought that it’s worth sharing. The fat infiniBand pipe between an Exadata box and a Hadoop cluster running in the Oracle Big Data Appliance gives pretty interesting results.

OakTableWorld, 23-24 September 2013, San Francisco

  • Session: Hacking Oracle Database
  • When: Tuesday, Sep 24. 1:00pm
  • Where: Imagination Lab @ Creativity Museum near Moscone
  • What: The “secret” OakTableWorld event (once called Oracle Closed World :) is an unofficial, fun (but also serious) satellite event during OOW. Lots of great technical speakers and topics. Might also have free beer :) I will deliver a-yet-another hacking session without much structure – I’ll just show how I research and explore Oracle’s new low-level features and which tools & approaches I use etc. Should be fun.

Strata Conference + HadoopWorld NY, 28-30 October 2013, New York City, NY

I have also submitted abstracts to RMOUG Training Days 2014 and will deliver the Training Day at Hotsos Symposium 2014!

So see you at any of these conferences!

Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql

$
0
0

Here’s a treat for the hard-core Oracle performance geeks out there – I’m releasing a cool, but still experimental script for ASH (or poor-man’s ASH)-based wait event analysis, which should add a whole new dimension into ASH based performance analysis. It doesn’t replace any of the existing ASH analysis techniques, but should bring the relationships between Oracle sessions in complex wait chains out to bright daylight much easier than before.

You all are familiar with the AWR/Statspack timed event summary below:

AWR top timed events
Similar breakdown can be gotten by just aggregating ASH samples by the wait event:

SQL> @ash/dashtop session_state,event 1=1 "TIMESTAMP'2013-09-09 21:00:00'" "TIMESTAMP'2013-09-09 22:00:00'"

%This  SESSION EVENT                                                            TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  68%  ON CPU                                                                          25610      25610          0           0           0          0             0          0          0          0          0          0              0          0          0 09-SEP-13 09.00.01.468 PM                                                   09-SEP-13 09.59.58.059 PM
  14%  WAITING SQL*Net more data from client                                            5380          0          0           0           0          0             0          0          0       5380          0          0              0          0          0 09-SEP-13 09.00.01.468 PM                                                   09-SEP-13 09.59.58.059 PM
   6%  WAITING enq: HW - contention                                                     2260          0          0           0           0          0          2260          0          0          0          0          0              0          0          0 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.56.07.626 PM
   3%  WAITING log file parallel write                                                  1090          0          0           0           0          0             0          0          0          0       1090          0              0          0          0 09-SEP-13 09.00.11.478 PM                                                   09-SEP-13 09.59.58.059 PM
   2%  WAITING db file parallel write                                                    730          0          0           0           0          0             0          0          0          0        730          0              0          0          0 09-SEP-13 09.01.11.568 PM                                                   09-SEP-13 09.59.48.049 PM
   2%  WAITING enq: TX - contention                                                      600          0          0           0           0          0             0          0          0          0          0          0              0          0        600 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.48.16.695 PM
   1%  WAITING buffer busy waits                                                         560          0          0           0         560          0             0          0          0          0          0          0              0          0          0 09-SEP-13 09.10.02.492 PM                                                   09-SEP-13 09.56.07.626 PM
   1%  WAITING log file switch completion                                                420          0          0           0           0          0           420          0          0          0          0          0              0          0          0 09-SEP-13 09.47.16.562 PM                                                   09-SEP-13 09.47.16.562 PM
   1%  WAITING latch: redo allocation                                                    330          0          0           0           0          0             0          0          0          0          0          0              0          0        330 09-SEP-13 09.04.41.893 PM                                                   09-SEP-13 09.53.27.307 PM
...

The abovementioned output has one shortcoming in a multiuser (database) system – not all wait events are simple, where a session waits for OS to complete some self-contained operation (like an IO request). Often a session waits for another session (who holds some lock) or some background process who needs to complete some task before our session can continue. That other session may wait for a yet another session due to some other lock. The other session itself waits for a yet another one, thanks to some buffer pin (buffer busy wait). The session who holds the buffer pin, may itself be waiting for LGWR, who may in turn wait for DBWR etc… You get the point – sometimes we have a bunch of sessions, waiting for each other in a chain.

The V$WAIT_CHAINS view introduced in Oracle 11g is capable of showing such chains of waiting sessions – however it is designed to diagnose relatively long-lasting hangs, not performance problems and short (but non-trivial) contention. Usually the DIAG process, who’s responsible for walking through the chains and populating V$WAIT_CHAINS, doesn’t kick in after a few seconds of ongoing session waits and the V$WAIT_CHAINS view may be mostly empty – so we need something different for performance analysis.

Now, it is possible to pick one of the waiting sessions and use the blocking_session column to look up the blocking SID – and see what that session was doing and so on. I used to do this somewhat manually, but realized that a simple CONNECT BY loop with some ASH sample iteration trickery can easily give us complex wait chain signature  & hierarchy information.

So, I am presenting to you my new ash_wait_chains.sql script! (Woo-hoo!). It’s actually completely experimental right now, I’m not fully sure whether it returns correct results and some of the planned syntax isn’t implemented yet :) EXPERIMENTAL, not even beta :) And no RAC nor DBA_HIST ASH support yet.

In the example below, I am running the script with parameter event2, which is just really a pseudocolumn on ASH view (look in the script). So it shows just the wait events of the sessions involved in a chain. You can use any ASH column there for extra information, like program, module, sql_opname etc:

SQL> @ash/ash_wait_chains event2 1=1 sysdate-1 sysdate

-- ASH Wait Chain Signature script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------
  60%       77995 -> ON CPU
  10%       12642 -> cell single block physical read
   9%       11515 -> SQL*Net more data from client
   2%        3081 -> log file parallel write
   2%        3073 -> enq: HW - contention  -> cell smart file creation
   2%        2723 -> enq: HW - contention  -> buffer busy waits  -> cell smart file creation
   2%        2098 -> cell smart table scan
   1%        1817 -> db file parallel write
   1%        1375 -> latch: redo allocation  -> ON CPU
   1%        1023 -> enq: TX - contention  -> buffer busy waits  -> cell smart file creation
   1%         868 -> block change tracking buffer space
   1%         780 -> enq: TX - contention  -> buffer busy waits  -> ASM file metadata operation
   1%         773 -> latch: redo allocation
   1%         698 -> enq: TX - contention  -> buffer busy waits  -> DFS lock handle
   0%         529 -> enq: TX - contention  -> buffer busy waits  -> control file parallel write
   0%         418 -> enq: HW - contention  -> buffer busy waits  -> DFS lock handle

What does the above output tell us? I have highlighted 3 rows above, let’s say that we want to get more insight about the enq: HW – contention wait event. The ash_wait_chains script breaks down the wait events by the “complex” wait chain signature (who’s waiting for whom), instead of just the single wait event name! You read the wait chain information from left to right. For example 2% of total response time (3073 seconds) in the analysed ASH dataset was spent by a session, who was waiting for a HW enqueue, which was held by another session, who itself happened to be waiting for the cell smart file creation wait event. The rightmost wait event is the “ultimate blocker”. Unfortunately ASH doesn’t capture idle sessions by default (an idle session may still be holding a lock and blocking others), so in the current version (0.1) of this script, the idle blockers are missing from the output. You can still manually detect a missing / idle session as we’ll see later on.

So, this script allows you to break down the ASH wait data not just by a single, “scalar” wait event, but all wait events (and other attributes) of a whole chain of waiting sessions – revealing hierarchy and dependency information in your bottleneck! Especially when looking into the chains involving various background processes, we gain interesting insight into the process flow in an Oracle instance.

As you can add any ASH column to the output (check the script, it’s not too long), let’s add the program info too, so we’d know what kind of sessions were involved in the complex waits (I have removed a bunch of less interesting lines from the output and have highlighted some interesting ones (again, read the chain of waiters from left to right):

SQL> @ash/ash_wait_chains program2||event2 1=1 sysdate-1 sysdate

-- ASH Wait Chain Signature script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- -----------------------------------------------------------------------------------------------------------------------------------------------
  56%       73427 -> (JDBC Thin Client) ON CPU
   9%       11513 -> (JDBC Thin Client) SQL*Net more data from client
   9%       11402 -> (oracle@enkxndbnn.enkitec.com (Jnnn)) cell single block physical read
   2%        3081 -> (LGWR) log file parallel write
   2%        3073 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) cell smart file creation
   2%        2300 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) cell smart file creation
...
   1%        1356 -> (JDBC Thin Client) latch: redo allocation  -> (JDBC Thin Client) ON CPU
   1%        1199 -> (JDBC Thin Client) cell single block physical read
   1%        1023 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   1%         881 -> (CTWR) ON CPU
   1%         858 -> (JDBC Thin Client) block change tracking buffer space
   1%         780 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   1%         766 -> (JDBC Thin Client) latch: redo allocation
   1%         698 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle
   0%         529 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) control file parallel write
   0%         423 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   0%         418 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   0%         418 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle
...
   0%          25 -> (JDBC Thin Client) gcs drm freeze in enter server mode  -> (LMON) ges lms sync during dynamic remastering and reconfig  -> (LMSn) ON CPU
   0%          25 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) DFS lock handle  -> (DBWn) db file parallel write
...
   0%          18 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) CSS operation: action
   0%          18 -> (LMON) control file sequential read
   0%          17 -> (LMSn) ON CPU
   0%          17 -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) cell smart file creation
   0%          16 -> (JDBC Thin Client) enq: FB - contention  -> (JDBC Thin Client) gcs drm freeze in enter server mode  -> (LMON) ges lms sync during dynamic remastering and reconfig  -> (LMSn) ON CPU
...
   0%           3 -> (JDBC Thin Client) enq: HW - contention  -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) CSS operation: action
   0%           3 -> (JDBC Thin Client) buffer busy waits  -> (JDBC Thin Client) latch: redo allocation  -> (JDBC Thin Client) ON CPU
...
   0%           1 -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) enq: TX - contention  -> (JDBC Thin Client) buffer busy waits  -> (Wnnn) ASM file metadata operation
   0%           1 -> (Wnnn) buffer busy waits  -> (JDBC Thin Client) block change tracking buffer space
...

In a different example, when adding the sql_opname column (11.2+) we can also display the type of SQL command that ended up waiting (like UPDATE, DELETE, LOCK etc) which is useful for lock contention analysis:

SQL> @ash/ash_wait_chains sql_opname||':'||event2 1=1 sysdate-1/24/60 sysdate

-- Display ASH Wait Chain Signatures script v0.1 EXPERIMENTAL by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------
  18%          45 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention
  18%          45 -> LOCK TABLE:enq: TM - contention
  17%          42 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention
  10%          25 -> SELECT:direct path read
   6%          14 -> TRUNCATE TABLE:enq: TM - contention
   5%          13 -> SELECT:ON CPU
   2%           5 -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> SELECT:db file scattered read
   2%           5 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           5 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> LOCK TABLE:enq: TM - contention  -> SELECT:db file scattered read
   2%           5 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:ON CPU
   2%           4 -> TRUNCATE TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> LOCK TABLE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> DELETE:enq: TM - contention  -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read
   2%           4 -> LOCK TABLE:enq: TM - contention  -> SELECT:direct path read

This is an experimental script and has some issues and shortcomings right now. For example, if the final blocking session itself was idle during the sampling, like the blue highlighted line above, then the blocking session is not shown there – as ASH doesn’t capture idle session samples. Also it doesn’t work for RAC right now (and this might be problematic as in RAC the blocking_session info for global wait events may not immediately be resolved either, unlike in the instance-local cases). And I will try to find better example cases illustrating the use of this technique (you’re welcome to send me your output from busy databases for analysis :)

Note that I will be doing my Advanced Oracle Troubleshooting class again in Oct/Nov 2013, so now you know where to learn more cool (and even more useful) Oracle troubleshooting stuff ;-)

Why doesn’t ALTER SYSTEM SET EVENTS set the events or tracing immediately?

$
0
0

I received a question about ALTER SYSTEM in the comments section of another blog post recently.

Basically the question was that while ALTER SESSION SET EVENTS ’10046 … ‘ enabled the SQL Trace for the current session immediately, ALTER SYSTEM on the other hand didn’t seem to do anything at all for other sessions in the instance.

There’s an important difference in the behavior of ALTER SYSTEM when changing paramters vs. setting events.

For example, ALTER SYSTEM SET optimizer_mode = CHOOSE would change the value of this parameter immediately, for:

  1. Your own session
  2. All new sessions that will log in will pick up the new parameter value
  3. All other existing sessions

However, when you issue an ALTER SYSTEM SET EVENTS ’10046 TRACE NAME CONTEXT FOREVER, LEVEL 12′, the event changes in only #1 and #2 will happen:

  1. Your own session
  2. All new sessions that will log in will pick up the new event settings

This means that the existing, already logged in sessions, will not pick up any of the events set via ALTER SYSTEM!

This hopefully explains why sometimes the debug events don’t seem to work. But more importantly, this also means that when you disable an event (by setting it to “OFF” or to level 0) with ALTER SYSTEM, it does not affect the existing sessions who have this event enabled! So, you think you’re turning the tracing off for all sessions and go home, but really some sessions keep on tracing – until the filesystem is full (and you’ll get a phone call at 3am).

So, to be safe, you should use DBMS_MONITOR for your SQL Tracing needs, it doesn’t have the abovementioned problems. For other events you should use DBMS_SYSTEM.SET_EV/READ_EV (or ORADEBUG EVENT/SESSION_EVENT &  EVENTS/EVENTDUMP) together with ALTER SYSTEM for making sure you actually do enable/disable the events for all existing sessions too. Or better yet, stay away from undocumented events ;-)

If you wonder what/where is the “system event array”, it’s just a memory location in shared pool. It doesn’t seem to be explicitly visible in V$SGASTAT in Oracle 10g, but in 11.2.0.3 you get this:

No system-wide events set:

SQL> @sgastat event

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  DBWR event stats array            216
shared pool  KSQ event description            8460
shared pool  Wait event pointers               192
shared pool  dbgdInitEventGrp: eventGr         136
shared pool  event classes                    1552
shared pool  event descriptor table          32360
shared pool  event list array to post           36
shared pool  event list to post commit         108
shared pool  event statistics per sess     2840096
shared pool  event statistics ptr arra         992
shared pool  event-class map                  4608
shared pool  ksws service events             57260
shared pool  latch wait-event table           2212
shared pool  standby event stats              1216
shared pool  sys event stats                539136
shared pool  sys event stats for Other       32256
shared pool  trace events array              72000

17 rows selected.

Let’s set a system-wide event:

SQL> ALTER SYSTEM SET events = '942 TRACE NAME ERRORSTACK LEVEL 3'; 

System altered.

And check V$SGASTAT again:

SQL> @sgastat event

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  DBWR event stats array            216
shared pool  KSQ event description            8460
shared pool  Wait event pointers               192
shared pool  dbgdInitEventG                   4740
shared pool  dbgdInitEventGrp: eventGr         340
shared pool  dbgdInitEventGrp: subHeap          80
shared pool  event classes                    1552
shared pool  event descriptor table          32360
shared pool  event list array to post           36
shared pool  event list to post commit         108
shared pool  event statistics per sess     2840096
shared pool  event statistics ptr arra         992
shared pool  event-class map                  4608
shared pool  ksws service events             57260
shared pool  latch wait-event table           2212
shared pool  standby event stats              1216
shared pool  sys event stats                539136
shared pool  sys event stats for Other       32256
shared pool  trace events array              72000

19 rows selected.

So, the “system event array” lives in shared pool, as a few memory allocations with name like “dbgdInitEventG%”. Note that this naming was different in 10g, as the dbgd module showed up in Oracle 11g, when Oracle guys re-engineered the whole diagnostics event infrastructure, making it much more powerful, for example allowing you to enable dumps and traces only for a specific SQL_ID.

Ok, that’s enough for today – I’ll just remind you that this year’s last Advanced Oracle Troubleshooting online seminar starts in 2 weeks (and I will be talking more about event setting there ;-)

Enkitec Rocks Again!!!

$
0
0

Yes, the title of this post is also a reference to the Machete Kills Again (…in Space) movie that I’m sure going to check out once it’s out ;-)

Enkitec won the UKOUG Engineered Systems Partner of the Year award last Thursday and I’m really proud of that. We got seriously started in Europe a bit over 2 years ago, have put in a lot of effort and by now it’s paying off. In addition to happy customers, interesting projects and now this award, we have awesome people in 4 countries in Europe already – and more to come. Also, you did know that Frits Hoogland and Martin Bach have joined us in Europe, right? I actually learn stuff from these guys! :-) Expect more awesome announcements soon! ;-)

Thank you, UKOUG team!

Also, if you didn’t follow all the Oracle OpenWorld action, we also got two major awards at OOW this year! (least year “only” one ;-)). We received two 2013 Oracle Excellence Awards for Specialized Partner of the Year (North America) for our work in both Financial Services and Energy/Utilities industries.

Thanks, Oracle folks! :)

Check out this Oracle Partner Network’s video interview with Martin Paynter where he explains where we come from and how we get things done:

Oh, by the way, Enkitec is hiring ;-)

SGA bigger than than the amount of HugePages configured (Linux – 11.2.0.3)

$
0
0

I just learned something new yesterday when demoing large page use on Linux during my AOT seminar.

I had 512 x 2MB hugepages configured in Linux ( 1024 MB ). So I set the USE_LARGE_PAGES = TRUE (it actually is the default anyway in 11.2.0.2+). This allows the use of large pages (it doesn’t force, the ONLY option would force the use of hugepages, otherwise the instance wouldn’t start up). Anyway, the previous behavior with hugepages was, that if Oracle was not able to allocate the entire SGA from the hugepages area, it would silently allocate the entire SGA from small pages. It was all or nothing. But to my surprise, when I set my SGA_MAX_SIZE bigger than the amount of allocated hugepages in my testing, the instance started up and the hugepages got allocated too!

It’s just that the remaining part was allocated as small pages, as mentioned in the alert log entry below (and in the latest documentation too – see the link above):

Thu Oct 24 20:58:47 2013
ALTER SYSTEM SET sga_max_size='1200M' SCOPE=SPFILE;
Thu Oct 24 20:58:54 2013
Shutting down instance (abort)
License high water mark = 19
USER (ospid: 18166): terminating the instance
Instance terminated by USER, pid = 18166
Thu Oct 24 20:58:55 2013
Instance shutdown complete
Thu Oct 24 20:59:52 2013
Adjusting the default value of parameter parallel_max_servers
from 160 to 135 due to the value of parameter processes (150)
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 1024 MB (85%)

Large Pages used by this instance: 512 (1024 MB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 512 (1024 MB)
Large Page size = 2048 KB

RECOMMENDATION:
  Total Shared Global Region size is 1202 MB. For optimal performance,
  prior to the next instance restart increase the number
  of unused Large Pages by atleast 89 2048 KB Large Pages (178 MB)
  system wide to get 100% of the Shared
  Global Region allocated with Large pages
***********************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0

The ipcs -m command confirmed this (multiple separate shared memory segments had been created).

Note that despite what documentation says, there’s a 4th option, called AUTO for the USE_LARGE_PAGES parameter too (in 11.2.0.3+ I think), which can now ask the OS to increase the number of hugepages when instance starts up – but I would always try to pre-allocate the right number of hugepages from start (ideally right after the OS reboot – via sysctl.conf), to reduce any overhead potential kernel CPU usage spikes due to search and defragmentation effort for “building” large consecutive pages.

Marting Bach has written about the AUTO option here.


Diagnosing buffer busy waits with the ash_wait_chains.sql script (v0.2)

$
0
0

In my previous post ( Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with ash_wait_chains.sql ) I introduced an experimental script for analysing performance from “top ASH wait chains” perspective. The early version (0.1) of the script didn’t have the ability to select a specific user (session), SQL or module/action performance data for analysis. I just hadn’t figured out how to write the SQL for this (as the blocking sessions could come from any user). So it turns out it was just matter of using “START WITH <condition>” in the connect by loop. Now the paramter 2 of this script (whose activty to measure) actually works.

Download the latest version of:

  1. ash_wait_chains.sql (ASH wait chain analysis with the V$ACTIVE_SESSION_HISTORY data)
  2. dash_wait_chains.sql (ASH wait chain analysis with the DBA_HIST_ACTIVE_SESS_HISTORY data)

So, now the parameter #2 is actually used, for example the username=’TANEL’ syntax below means that we will list only Tanel’s sessions as top-level waiters in the report, but of course as a Tanel’s session may be blocked by any other user’s session, this where clause doesn’t restrict displaying any of the blockers, regardless of their username:

@ash_wait_chains program2||event2 username='TANEL' sysdate-1/24 sysdate

I have added one more improvement, which you’ll see in a moment. So here’s a problem case. I was performance testing parallel loading of a 1TB “table” from Hadoop to Oracle (on Exadata). I was using external tables (with the Hadoop SQL Connector) and here’s the SQL Monitor report’s activity chart:

data_load_buffer_busy_waits

A large part of the response time was spent waiting for buffer busy waits! So, normally the next step here would be to check:

  1. Check the type (and location) of the block involved in the contention – and also whether there’s a single very “hot” block involved in this or are there many different “warm” blocks that add up. Note that I didn’t say “block causing contention” here, as block is just a data structure, it doesn’t cause contention – it’s the sessions who lock this block, do.
  2. Who’s the session holding this lock (pin) on the buffer – and is there a single blocking session causing all this or many different sessions that add up to the problem.
  3. What are the blocking sessions themselves doing (e.g. are they stuck waiting for something else themselves?)

Let’s use the “traditional” approach first. As I know the SQL ID and this SQL runtime (from the SQL Monitoring report for example), I can query ASH records with my ashtop.sql script (warning, wide output):

SQL> @ash/ashtop session_state,event sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

%This  SESSION EVENT                                                            TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  57%  WAITING buffer busy waits                                                       71962          0          0           0       71962          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.45.54.106 AM
  35%  ON CPU                                                                          43735      43735          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.55.903 AM                                                   05-OCT-13 02.47.28.232 AM
   6%  WAITING direct path write                                                        6959          0       6959           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.07.923 AM                                                   05-OCT-13 02.47.21.232 AM
   1%  WAITING external table read                                                      1756          0       1756           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING local write wait                                                          350          0        350           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 02.02.40.034 AM                                                   05-OCT-13 02.46.59.202 AM
   0%  WAITING control file parallel write                                               231          0          0           0           0          0             0          0          0          0        231          0              0          0          0 05-OCT-13 01.35.22.953 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING cell smart file creation                                                  228          0        228           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.47.26.232 AM
   0%  WAITING DFS lock handle                                                           194          0          0           0           0          0             0          0          0          0          0          0              0          0        194 05-OCT-13 01.35.15.933 AM                                                   05-OCT-13 02.47.14.222 AM
   0%  WAITING cell single block physical read                                           146          0        146           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.12.933 AM                                                   05-OCT-13 02.47.09.212 AM
   0%  WAITING control file sequential read                                               63          0          0           0           0          0             0          0          0          0         63          0              0          0          0 05-OCT-13 01.35.17.953 AM                                                   05-OCT-13 02.46.56.192 AM
   0%  WAITING change tracking file synchronous read                                      57          0          0           0           0          0             0          0          0          0          0          0              0          0         57 05-OCT-13 01.35.26.963 AM                                                   05-OCT-13 02.40.32.677 AM
   0%  WAITING db file single write                                                       48          0         48           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.38.21.317 AM                                                   05-OCT-13 02.41.55.794 AM
   0%  WAITING gc current grant 2-way                                                     19          0          0           0           0          0             0         19          0          0          0          0              0          0          0 05-OCT-13 01.35.06.923 AM                                                   05-OCT-13 02.45.46.096 AM
   0%  WAITING kfk: async disk IO                                                         13          0          0           0           0          0             0          0          0          0         13          0              0          0          0 05-OCT-13 01.42.34.791 AM                                                   05-OCT-13 02.38.19.485 AM
   0%  WAITING resmgr:cpu quantum                                                          9          0          0           0           0          0             0          0          0          0          0          9              0          0          0 05-OCT-13 01.36.09.085 AM                                                   05-OCT-13 01.59.08.635 AM
   0%  WAITING enq: CR - block range reuse ckpt                                            7          0          0           0           0          0             0          0          0          0          0          0              0          0          7 05-OCT-13 02.12.42.069 AM                                                   05-OCT-13 02.40.46.687 AM
   0%  WAITING latch: redo allocation                                                      3          0          0           0           0          0             0          0          0          0          0          0              0          0          3 05-OCT-13 02.10.01.807 AM                                                   05-OCT-13 02.10.01.807 AM
   0%  WAITING Disk file operations I/O                                                    2          0          2           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.41.13.639 AM                                                   05-OCT-13 01.43.50.951 AM
   0%  WAITING enq: XL - fault extent map                                                  2          0          0           0           0          0             0          0          0          0          0          0              0          0          2 05-OCT-13 01.35.34.983 AM                                                   05-OCT-13 01.35.34.983 AM
   0%  WAITING external table open                                                         2          0          2           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 01.35.02.913 AM

57% of the DB time was spent waiting for buffer busy waits. So let’s check the P2/P3 to see which block# (in a datafile) and block class# we are dealing with:

SQL> @ash/ashtop session_state,event,p2text,p2,p3text,p3 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

%This  SESSION EVENT                                                            P2TEXT                                 P2 P3TEXT                                 P3 TotalSeconds        CPU   User I/O Application Concurrency     Commit Configuration    Cluster       Idle    Network System I/O  Scheduler Administrative   Queueing      Other MIN(SAMPLE_TIME)                                                            MAX(SAMPLE_TIME)
------ ------- ---------------------------------------------------------------- ------------------------------ ---------- ------------------------------ ---------- ------------ ---------- ---------- ----------- ----------- ---------- ------------- ---------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
  57%  WAITING buffer busy waits                                                block#                                  2 class#                                 13        71962          0          0           0       71962          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.45.54.106 AM
  31%  ON CPU                                                                   file#                                   0 size                               524288        38495      38495          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.05.923 AM                                                   05-OCT-13 02.47.25.232 AM
   1%  WAITING external table read                                              file#                                   0 size                               524288         1756          0       1756           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.02.913 AM                                                   05-OCT-13 02.47.15.222 AM
   1%  ON CPU                                                                   block#                                  2 class#                                 13          945        945          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.16.943 AM                                                   05-OCT-13 02.45.10.056 AM
   0%  ON CPU                                                                   consumer group id                   12573                                         0          353        353          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.56.903 AM                                                   05-OCT-13 01.59.59.739 AM
   0%  WAITING cell smart file creation                                                                                 0                                         0          228          0        228           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.09.923 AM                                                   05-OCT-13 02.47.26.232 AM
   0%  WAITING DFS lock handle                                                  id1                                     3 id2                                     2          193          0          0           0           0          0             0          0          0          0          0          0              0          0        193 05-OCT-13 01.35.15.933 AM                                                   05-OCT-13 02.47.14.222 AM
   0%  ON CPU                                                                   file#                                  41 size                                   41          118        118          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.34.56.903 AM                                                   05-OCT-13 01.35.02.913 AM
   0%  WAITING cell single block physical read                                  diskhash#                      4004695794 bytes                                8192           85          0         85           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.12.933 AM                                                   05-OCT-13 02.47.09.212 AM
   0%  WAITING control file parallel write                                      block#                                  1 requests                                2           81          0          0           0           0          0             0          0          0          0         81          0              0          0          0 05-OCT-13 01.35.22.953 AM                                                   05-OCT-13 02.41.54.794 AM
   0%  WAITING control file parallel write                                      block#                                 41 requests                                2           74          0          0           0           0          0             0          0          0          0         74          0              0          0          0 05-OCT-13 01.35.31.983 AM                                                   05-OCT-13 02.47.15.222 AM
   0%  WAITING change tracking file synchronous read                            blocks                                  1                                         0           57          0          0           0           0          0             0          0          0          0          0          0              0          0         57 05-OCT-13 01.35.26.963 AM                                                   05-OCT-13 02.40.32.677 AM
   0%  WAITING control file parallel write                                      block#                                 42 requests                                2           51          0          0           0           0          0             0          0          0          0         51          0              0          0          0 05-OCT-13 01.35.23.953 AM                                                   05-OCT-13 02.47.10.212 AM
   0%  WAITING db file single write                                             block#                                  1 blocks                                  1           48          0         48           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.38.21.317 AM                                                   05-OCT-13 02.41.55.794 AM
   0%  ON CPU                                                                                                           0                                         0           31         31          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.19.953 AM                                                   05-OCT-13 02.44.32.006 AM
   0%  WAITING control file parallel write                                      block#                                 39 requests                                2           21          0          0           0           0          0             0          0          0          0         21          0              0          0          0 05-OCT-13 01.36.35.125 AM                                                   05-OCT-13 02.39.30.575 AM
   0%  WAITING control file sequential read                                     block#                                  1 blocks                                  1           20          0          0           0           0          0             0          0          0          0         20          0              0          0          0 05-OCT-13 01.35.17.953 AM                                                   05-OCT-13 02.46.56.192 AM
   0%  ON CPU                                                                   locn                                    0                                         0           19         19          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.35.34.983 AM                                                   05-OCT-13 02.30.34.786 AM
   0%  ON CPU                                                                   fileno                                  0 filetype                                2           16         16          0           0           0          0             0          0          0          0          0          0              0          0          0 05-OCT-13 01.36.08.075 AM                                                   05-OCT-13 02.44.26.996 AM
   0%  WAITING kfk: async disk IO                                               intr                                    0 timeout                        4294967295           13          0          0           0           0          0             0          0          0          0         13          0              0          0          0 05-OCT-13 01.42.34.791 AM                                                   05-OCT-13 02.38.19.485 AM

Buffer busy waits on block #2 of a datafile? Seems familiar … But instead of guessing or dumping the block (to see what type it is) we can just check what the block class# 13 is:

SQL> @bclass 13

CLASS              UNDO_SEGMENT_ID
------------------ ---------------
file header block

So, the block #2 is something called a “file header block”. Actually don’t confuse it with block #1 which is the real datafile header block (as we know it from the concepts guide), the block #2 is actually the LMT tablespace space management bitmap header block (from the P1 value I saw the file# was 16, so I dumped the block with ALTER SYSTEM DUMP DATAFILE 16 BLOCK 2 command):

Start dump data blocks tsn: 21 file#:16 minblk 2 maxblk 2
Block dump from cache:
Dump of buffer cache at level 4 for tsn=21 rdba=2
BH (0x1e1f25718) file#: 16 rdba: 0x00000002 (1024/2) class: 13 ba: 0x1e1876000
  set: 71 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 2,22
  dbwrid: 2 obj: -1 objn: 1 tsn: 21 afn: 16 hint: f
  hash: [0x24f5a3008,0x24f5a3008] lru: [0x16bf3c188,0x163eee488]
  ckptq: [NULL] fileq: [NULL] objq: [0x173f10230,0x175ece830] objaq: [0x22ee19ba8,0x16df2f2c0]
  st: SCURRENT md: NULL fpin: 'ktfbwh00: ktfbhfmt' tch: 162 le: 0xcefb4af8
  flags: foreground_waiting block_written_once redo_since_read
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [2]
Block dump from disk:
buffer tsn: 21 rdba: 0x00000002 (1024/2)
scn: 0x0001.13988a3a seq: 0x02 flg: 0x04 tail: 0x8a3a1d02
frmt: 0x02 chkval: 0xee04 type: 0x1d=KTFB Bitmapped File Space Header
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x00007F6708C71800 to 0x00007F6708C73800
7F6708C71800 0000A21D 00000002 13988A3A 04020001  [........:.......]
7F6708C71810 0000EE04 00000400 00000008 08CEEE00  [................]

So, the block number 2 is actually the first LMT space bitmap block. Every time you need to manage LMT space (allocate extents, search for space in the file or release extents), you will need to pin the block #2 of the datafile – and everyone else who tries to do the same has to wait.

Now the next question should be – who’s blocking us – who’s holding this block pinned so much then? And what is the blocker itself doing? This is where the ASH wait chains script comes into play – instead of me manually looking up the “blocking_session” column value, I’m just using a CONNECT BY to look up what the blocker itself was doing:

SQL> @ash/ash_wait_chains event2 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  35%       43735 -> ON CPU
  15%       18531 -> buffer busy waits [file header block]  -> ON CPU
   7%        9266 -> buffer busy waits [file header block]  -> control file parallel write
   7%        8261 -> buffer busy waits [file header block]  -> cell smart file creation
   6%        6959 -> direct path write
   5%        6707 -> buffer busy waits [file header block]  -> DFS lock handle
   4%        4658 -> buffer busy waits [file header block]  -> local write wait
   4%        4610 -> buffer busy waits [file header block]  -> cell single block physical read
   3%        4282 -> buffer busy waits [file header block]  -> local write wait  -> db file parallel write
   2%        2801 -> buffer busy waits [file header block]
   2%        2676 -> buffer busy waits [file header block]  -> ASM file metadata operation
   2%        2092 -> buffer busy waits [file header block]  -> change tracking file synchronous read
   2%        2050 -> buffer busy waits [file header block]  -> control file sequential read

If you follow the arrows, 15% of response time of this SQL was spent waiting for buffer busy waits [on the LMT space header block], while the blocker (we don’t know who yet) was itself on CPU – doing something. There’s 7+7% percent waited on buffer busy waits, while the blocker itself was either waiting for “control file parallel write” or “cell smart file creation” – the offloaded datafile extension wait event.

Let’s see who blocked us (which username and which program):

SQL> @ash/ash_wait_chains username||':'||program2||event2 sql_id='3rtbs9vqukc71' "timestamp'2013-10-05 01:00:00'" "timestamp'2013-10-05 03:00:00'"

-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS WAIT_CHAIN
------ ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  35%       43732 -> TANEL:(Pnnn) ON CPU
  13%       16908 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) ON CPU
   6%        6959 -> TANEL:(Pnnn) direct path write
   4%        4838 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) control file parallel write
   4%        4428 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) control file parallel write
   3%        4166 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) cell smart file creation
   3%        4095 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) cell smart file creation
   3%        3607 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) DFS lock handle
   3%        3147 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) local write wait
   2%        3117 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) local write wait  -> SYS:(DBWn) db file parallel write
   2%        3100 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) DFS lock handle
   2%        2801 -> TANEL:(Pnnn) buffer busy waits [file header block]
   2%        2764 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> TANEL:(Pnnn) cell single block physical read
   2%        2676 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) ASM file metadata operation
   1%        1825 -> TANEL:(Pnnn) buffer busy waits [file header block]  -> SYS:(Wnnn) cell single block physical read

Now we also see who was blocking us? In the line with 13% response time, we were blocked by another TANEL user running a PX slave (I’m replacing digits to “n” in the background processes, like P000, for collapsing them all into a single line). In line #4 above with th 4% of wait time, TANEL’s PX slave was blocked by SYS running a Wnnn process – it’s one of the workers for space management stuff like asynchronous datafile pre-extension. So, looks like the datafile extension bottleneck has a role to play in this! I could pre-extend the datafile in advance myself (it’s a bigfile tablespace) when aniticipating a big data load that has to happen fast – but the first thing I would do is check whether the tablespace is an LMT tablespace with UNIFORM extent sizing policy and the minimum extent size is big enough for my data load. In other words, your biggest fact tables with heavy data loads should reside in an LMT tablespace with UNIFORM extent size of multiple MB (I’ve used between 32MB and 256MB extent size). The AUTOALLOCATE doesn’t really save you much – as with autoallocate there’s a LMT space management bitmap bit per each 64kB chunk of the file – there’s much more overhead when searching and allocating a big extent consisting of many small LMT “chunks” vs a big extent which consists of only a single big LMT chunk.

I will come back to the topic of buffer busy waits on block #2 in a following post – this post aims to show that you can pretty much use any ASH column you like to more details about who was doing what in the entire wait chain.

And another topic for the future would be the wait events which do not populate blocking session info proprely (depending on version, most latches and mutexes for example). The manual approach would still be needed there – although with mutexes is possible to extract the (exclusive) blocking SID from the wait event’s PARAMETER2 value. Maybe in ash_wait_chains v0.3 :-)

 

When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 1?

$
0
0

This post applies both to non-Exadata and Exadata systems.

Before Oracle 11.2 came out, it was true to say that Oracle Parallel Execution slaves always do direct path reads (bypassing buffer cache) when doing full segment scans. This should not be taken simplistically though. Even when you were doing full table scans, then yes the scanning was done with direct path multiblock reads – but if you had to visit other, additional blocks out of the scanning sequence, then these extra IOs were done with regular buffered reads. For example, next row piece fetching of chained rows or or undo block access for CR reads was done with buffered single block reads, or even buffered multiblock reads, if some form of prefetching kicked in.

In addition to that, random table/index accesses like index range scans and the following table block fetches are always done in a buffered way both in serial and parallel execution cases.

Starting from Oracle 11.2 though, Oracle parallel execution slaves can also do the parallel full segment scans via the buffer cache. The feature is called In-Memory Parallel Execution, not to be confused with the Oracle 12c upcoming In-Memory Option (which gives you a columnar, compressed, in-memory cache of your on-disk data).

The 11.2 in-memory parallel execution doesn’t introduce any new data formats, but just allows you to cache your hottest tables across buffer caches of multiple RAC nodes (different nodes hold different “fragments” of a segment) and allow PX slaves to avoid physical disk reads and even work entirely from the local buffer cache of the RAC node the PX slave is running on (data locality!). You should use Oracle 11.2.0.3+ for in-mem PX as it has some improvements and bugfixes for node- and NUMA-affinity for PX slaves.

So, this is a great feature - if disk scanning, data retrieval IO is your bottleneck (and you have plenty of memory). But on Exadata, the storage cells give you awesome disk scanning speeds and data filtering/projection/decompression offloading anyway. If you use buffered reads, then you won’t use Smart Scans – as smart scans need direct path reads as a prerequisite. And if you don’t use smart scans, the Exadata storage cells will just act as block IO servers for you – and even if you have all the data cached in RAM, your DB nodes (compute nodes) would be used for all the filtering and decompression of billions of rows. Also, there’s no storage indexing in memory (well, unless you use zone-maps which do a similar thing at higher level in 12c).

So, long story short, you likely do not want to use In-Memory PX on Exadata – and even on non-Exadata, you probably do not want it to kick in automatically at “random” times without you controlling this. So, this leads to the question of when does the in-memory PX kick in and how it is controlled?

Long story short – the in-memory PX can kick when either of the following options is true:

  1. When _parallel_cluster_cache_policy = CACHED (this will be set so from default when parallel_degree_policy = AUTO)
  2. When the segment (or partition) is marked as CACHE or KEEP, for example:
    • ALTER TABLE t STORAGE (BUFFER_POOL KEEP)
    • ALTER TABLE t CACHE

So, while #1 is relatively well known behavior, the #2 is not. So, the In-Memory PX can kick in even if your parallel_degree_policy = MANUAL!

The default value for _parallel_cluster_cache_policy is ADAPTIVE – and that provides a clue. I read it the way that when set to ADAPTIVE, then the in-memory PX caching decision is “adaptively” done based on whether the segment is marked as CACHE/KEEP in data dictionary or not. My brief tests showed that when marking an object for buffer pool KEEP, then Oracle (11.2.0.3) tries to do the in-memory PX even if the combined KEEP pool size in the RAC cluster was smaller than the segment itself. Therefore, all scans ended up re-reading all blocks in from disk to the buffer cache again (and throwing “earlier” blocks of the segment out). When the KEEP pool was not configured, then the DEFAULT buffer cache was used (even if the object was marked to be in the KEEP pool). So, when marking your objects for keep or cache, make sure you have enough buffer cache allocated for this.

Note that with parallel_degree_policy=AUTO & _parallel_cluster_cache_policy=CACHED, Oracle tries to be more intelligent about this, allowing only up to _parallel_cluster_cache_pct (default value 80) percent of total buffer cache in the RAC cluster to be used for in-memory PX. I haven’t tested this deep enough to say I know how the algorithm works. So far I have preferred the manual CACHE/KEEP approach for the latest/hottest partitions (in a few rare cases that I’ve used it).

So, as long as you don’t mark your tables/indexes/partitions as CACHE or KEEP and use parallel_degree_policy=MANUAL or LIMITED, you should not get the in-mem PX to kick in and all parallel full table scans should get nicely offloaded on Exadata (unless you hit any of the other limitations that block a smart scan from happening).

Update 1: I just wrote Part 2 for this article too.

Update 2: Frits Hoogland pointed me to a bug/MOS Note which may be interesting to you if you use KEEP pool:

KEEP BUFFER POOL Does Not Work for Large Objects on 11g (Doc ID 1081553.1):

  • Due to this bug, tables with size >10% of cache size, were being treated as ‘large tables’ for their reads and this resulted in execution of a new SERIAL_DIRECT_READ path in 11g.
  • With the bug fix applied, any object in the KEEP buffer pool, whose size is less than DB_KEEP_CACHE_SIZE, is considered as a small or medium sized object. This will cache the read blocks and avoid subsequent direct read for these objects .

When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 2?

$
0
0

In the previous post about in-memory parallel execution I described in which cases the in-mem PX can kick in for your parallel queries.

A few years ago (around Oracle 11.2.0.2 and Exadata X2 release time) I was helping a customer with their migration to Exadata X2. Many of the queries ran way slower on Exadata compared to their old HP Superdome. The Exadata system was configured according to the Oracle’s “best practices”, that included setting the parallel_degree_policy = AUTO.

As there were thousands of reports (and most of them had performance issues) and we couldn’t extract SQL Monitoring reports (due to another issue) I just looked into ASH data for a general overview. A SQL Monitoring report takes the execution plan line time breakdown from ASH anyway.

First I ran a simple ASH query which counted the ASH samples (seconds spent) and grouped the results by the rowsource type (I was using a custom script then, but you could achieve the same with running @ash/ashtop "sql_plan_operation||' '||sql_plan_options" session_type='FOREGROUND' sysdate-1/24 sysdate for example):

SQL> @ash_custom_report

PLAN_LINE                                            COUNT(*)        PCT
-------------------------------------------------- ---------- ----------
TABLE ACCESS STORAGE FULL                              305073       47.6
                                                        99330       15.5
LOAD AS SELECT                                          86802       13.6
HASH JOIN                                               37086        5.8
TABLE ACCESS BY INDEX ROWID                             20341        3.2
REMOTE                                                  13981        2.2
HASH JOIN OUTER                                          8914        1.4
MAT_VIEW ACCESS STORAGE FULL                             7807        1.2
TABLE ACCESS STORAGE FULL FIRST ROWS                     6348          1
INDEX RANGE SCAN                                         4906         .8
HASH JOIN BUFFERED                                       4537         .7
PX RECEIVE                                               4201         .7
INSERT STATEMENT                                         3601         .6
PX SEND HASH                                             3118         .5
PX SEND BROADCAST                                        3079         .5
SORT AGGREGATE                                           3074         .5
BUFFER SORT                                              2266         .4
SELECT STATEMENT                                         2259         .4
TABLE ACCESS STORAGE SAMPLE                              2136         .3
INDEX UNIQUE SCAN                                        2090         .3

The above output shows that indeed over 47% of Database Time in ASH history has been spent in TABLE ACCESS STORAGE FULL operations (regardless of which SQL_ID). The blank 15.5% is activity by non-SQL plan execution stuff, like parsing, PL/SQL, background process activity, logins/connection management etc. But only about 3.2+0.8+0.3 = 4.3% of the total DB Time was spent in index related row sources. So, the execution plans for our reports were doing what they are supposed to be doing on an Exadata box – doing full table scans.

However, why were they so slow? Shouldn’t Exadata be doing full table scans really fast?! The answer is obviously yes, but only when the smart scans actually kick in!

As I wanted a high level overview, I added a few more columns to my script (rowsource_events.sql) – namely the IS_PARALLEL, which reports if the active session was a serial session or a PX slave and the wait event (what that session was actually doing).

As the earlier output had shown that close to 50% of DB time was spent on table scans – I drilled down to TABLE ACCESS STORAGE FULL data retrieval operations only, excluding all the other data processing stuff like joins, loads and aggregations:

SQL> @ash/rowsource_events TABLE% STORAGE%FULL

PLAN_LINE                                IS_PARAL SESSION WAIT_CLASS      EVENT                                      COUNT(*)        PCT
---------------------------------------- -------- ------- --------------- ---------------------------------------- ---------- ----------
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell multiblock physical read                139756         47
TABLE ACCESS STORAGE FULL                PARALLEL ON CPU                                                                64899       21.8
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell multiblock physical read                 24133        8.1
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell single block physical read               16430        5.5
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        read by other session                         12141        4.1
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc buffer busy acquire                        10771        3.6
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell smart table scan                          7573        2.5
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc cr multi block request                      7158        2.4
TABLE ACCESS STORAGE FULL                SERIAL   ON CPU                                                                 6872        2.3
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc cr multi block request                      2610         .9
TABLE ACCESS STORAGE FULL                PARALLEL WAITING User I/O        cell list of blocks physical read              1763         .6
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell single block physical read                1744         .6
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell list of blocks physical read               667         .2
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        cell smart table scan                           143          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc cr disk read                                 122          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc current grant busy                            97          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc current block 3-way                           85          0
TABLE ACCESS STORAGE FULL                PARALLEL WAITING Cluster         gc cr grant 2-way                                68          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING User I/O        direct path read                                 66          0
TABLE ACCESS STORAGE FULL                SERIAL   WAITING Cluster         gc current grant 2-way                           52          0

Indeed, over 60% of all my full table scan operations were waiting for buffered read related wait events! (both buffered IO waits and RAC global cache waits which you wouldn’t get with direct path reads, well unless some read consistency cloning & rollback operations are needed). So, Smart Scans were definitely not kicking in for some of those full table scans (despite the support analyst’s outdated claim about PX slaves “never” using buffered IOs)

Also, when you are not offiloading IOs, low level data processing and filtering etc to the storage cells, you will need to use your DB node CPUs for all that work (while the cell CPUs are almost idle, if nothing gets offloaded). So, when enabling the smart scans, both the DB IO wait times – and the DB CPU usage per unit of work done would also go down.

Anyway – after looking deeper into the issue – I noticed that they had parallel_degree_policy set to AUTO, so the in-memory PX also kicked in as a side-effect. Once we set this to LIMITED, the problem went away and instead of lots of buffered IO wait events, we got much less cell table smart scan wait events – and higher CPU utilization (yes, higher) as now we had a big bottleneck removed and got much more work done every second.

This was on Oracle 11.2.0.2, I didn’t have a chance to look any deeper, but I suspected that the problem came from the fact that this database had many different workloads in it, both smart scanning ones and also some regular, index based accesses – so possibly the “in-mem PX automatic memory distributor” or whatever controls what gets cached and what not, wasn’t accounting for tha all non-PX-scan activity. Apparently in Oracle 11.2.0.3 this should be improved – but haven’t tested this myself – I feel safer with manually controlling the CACHE/KEEP parameters.

Hard Drive Predictive Failures on Exadata

$
0
0

This post also applies to non-Exadata systems as hard drives work the same way in other storage arrays too – just the commands you would use for extracting the disk-level metrics would be different.

I just noticed that one of our Exadatas had a disk put into “predictive failure” mode and thought to show how to measure why the disk is in that mode (as opposed to just replacing it without really understanding the issue ;-)

SQL> @exadata/cellpd
Show Exadata cell versions from V$CELL_CONFIG....

DISKTYPE             CELLNAME             STATUS                 TOTAL_GB     AVG_GB  NUM_DISKS   PREDFAIL   POORPERF WTCACHEPROB   PEERFAIL   CRITICAL
-------------------- -------------------- -------------------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
FlashDisk            192.168.12.3         normal                      183         23          8
FlashDisk            192.168.12.3         not present                 183         23          8                     3
FlashDisk            192.168.12.4         normal                      366         23         16
FlashDisk            192.168.12.5         normal                      366         23         16
HardDisk             192.168.12.3         normal                    20489       1863         11
HardDisk             192.168.12.3         warning - predictive       1863       1863          1          1
HardDisk             192.168.12.4         normal                    22352       1863         12
HardDisk             192.168.12.5         normal                    22352       1863         12

So, one of the disks in storage cell with IP 192.168.12.3 has been put into predictive failure mode. Let’s find out why!

To find out which exact disk, I ran one of my scripts for displaying Exadata disk topology (partial output below):

SQL> @exadata/exadisktopo2
Showing Exadata disk topology from V$ASM_DISK and V$CELL_CONFIG....

CELLNAME             LUN_DEVICENAME       PHYSDISK                       PHYSDISK_STATUS                                                                  CELLDISK                       CD_DEVICEPART                                                                    GRIDDISK                       ASM_DISK                       ASM_DISKGROUP                  LUNWRITECACHEMODE
-------------------- -------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ ------------------------------ ------------------------------ ----------------------------------------------------------------------------------------------------
192.168.12.3         /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        DATA_CD_00_enkcel01            DATA_CD_00_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        RECO_CD_00_enkcel01            RECO_CD_00_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        DATA_CD_01_enkcel01            DATA_CD_01_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        RECO_CD_01_enkcel01            RECO_CD_01_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DATA_CD_02_enkcel01            DATA_CD_02_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DBFS_DG_CD_02_enkcel01         DBFS_DG_CD_02_ENKCEL01         DBFS_DG                        "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         RECO_CD_02_enkcel01            RECO_CD_02_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DATA_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DBFS_DG_CD_03_enkcel01                                                                       "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         RECO_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"

Ok, looks like /dev/sdd (with address 35:3) is the “failed” one.

When listing the alerts from the storage cell, indeed we see that a failure has been predicted, warning raised and even handled – XDMG process gets notified and the ASM disks get dropped from the failed grid disks (as you see from the exadisktopo output above if you scroll right).

CellCLI> LIST ALERTHISTORY WHERE alertSequenceID = 456 DETAIL;
	 name:              	 456_1
	 alertDescription:  	 "Data hard disk entered predictive failure status"
	 alertMessage:      	 "Data hard disk entered predictive failure status.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01"
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:48:03-06:00
	 endTime:           	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data hard disk has entered predictive failure status. A white cell locator LED has been turned on to help locate the affected cell, and an amber service action LED has been lit on the drive to help locate the affected drive. The data from the disk will be automatically rebalanced by Oracle ASM to other disks. Another alert will be sent and a blue OK-to-Remove LED will be lit on the drive when rebalance completes. Please wait until rebalance has completed before replacing the disk. Detailed information on this problem can be found at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1112995.1  "

	 name:              	 456_2
	 alertDescription:  	 "Hard disk can be replaced now"
	 alertMessage:      	 "Hard disk can be replaced now.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01 "
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data on this disk has been successfully rebalanced by Oracle ASM to other disks. A blue OK-to-Remove LED has been lit on the drive. Please replace the drive."

The two alerts show that we first detected a (soon) failing disk (event 456_1) and then ASM kicked in and dropped the ASM disks from the failing disk and rebalanced the data elsewhere (event 456_2).

But we still do not know why we are expecting the disk to fail! And the alert info and CELLCLI command output do not have this detail. This is where the S.M.A.R.T monitoring comes in. Major hard drive manufacturers support the SMART standard for both reactive and predictive monitoring of the hard disk internal workings. And there are commands for querying these metrics.

Let’s find the failed disk info at the cell level with CELLCLI:

CellCLI> LIST PHYSICALDISK;
	 35:0     	 JK11D1YAJTXVMZ	 normal
	 35:1     	 JK11D1YAJB4V0Z	 normal
	 35:2     	 JK11D1YAJAZMMZ	 normal
	 35:3     	 JK11D1YAJ7JX2Z	 warning - predictive failure
	 35:4     	 JK11D1YAJB3J1Z	 normal
	 35:5     	 JK11D1YAJB4J8Z	 normal
	 35:6     	 JK11D1YAJ7JXGZ	 normal
	 35:7     	 JK11D1YAJB4E5Z	 normal
	 35:8     	 JK11D1YAJ8TY3Z	 normal
	 35:9     	 JK11D1YAJ8TXKZ	 normal
	 35:10    	 JK11D1YAJM5X9Z	 normal
	 35:11    	 JK11D1YAJAZNKZ	 normal
	 FLASH_1_0	 1014M02JC3    	 not present
	 FLASH_1_1	 1014M02JYG    	 not present
	 FLASH_1_2	 1014M02JV9    	 not present
	 FLASH_1_3	 1014M02J93    	 not present
	 FLASH_2_0	 1014M02JFK    	 not present
	 FLASH_2_1	 1014M02JFL    	 not present
	 FLASH_2_2	 1014M02JF7    	 not present
	 FLASH_2_3	 1014M02JF8    	 not present
	 FLASH_4_0	 1014M02HP5    	 normal
	 FLASH_4_1	 1014M02HNN    	 normal
	 FLASH_4_2	 1014M02HP2    	 normal
	 FLASH_4_3	 1014M02HP4    	 normal
	 FLASH_5_0	 1014M02JUD    	 normal
	 FLASH_5_1	 1014M02JVF    	 normal
	 FLASH_5_2	 1014M02JAP    	 normal
	 FLASH_5_3	 1014M02JVH    	 normal

Ok, let’s look into the details, as we also need the deviceId for querying the SMART info:

CellCLI> LIST PHYSICALDISK 35:3 DETAIL;
	 name:              	 35:3
	 deviceId:          	 26
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 35
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 foreignState:      	 false
	 luns:              	 0_3
	 makeModel:         	 "HITACHI H7220AA30SUN2.0T"
	 physicalFirmware:  	 JKAOA28A
	 physicalInsertTime:	 2010-05-15T21:10:49-05:00
	 physicalInterface: 	 sata
	 physicalSerial:    	 JK11D1YAJ7JX2Z
	 physicalSize:      	 1862.6559999994934G
	 slotNumber:        	 3
	 status:            	 warning - predictive failure

Ok, the disk device was /dev/sdd, the disk name is 35:3 and the device ID is 26. And it’s a SATA disk. So I will run smartctl with the sat+megaraid device type option to query the disk SMART metrics – via the SCSI controller where the disks are attached to. Note that the ,26 in the end is the deviceId reported by the LIST PHYSICALDISK command. There’s quite a lot of output, I have highlighted the important part in red:

> smartctl  -a  /dev/sdd -d sat+megaraid,26
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-400.11.1.el5uek] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI H7220AA30SUN2.0T 1016M7JX2Z
Serial Number:    JK11D1YAJ7JX2Z
LU WWN Device Id: 5 000cca 221df9d11
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Nov 28 06:28:13 2013 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22330) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   028   028   016    Pre-fail  Always       -       430833663
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   117   117   024    Pre-fail  Always       -       614 (Average 624)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       69
  5 Reallocated_Sector_Ct   0x0033   058   058   005    Pre-fail  Always       -       743
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30754
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       23 (Min/Max 17/48)
196 Reallocated_Event_Count 0x0032   064   064   000    Old_age   Always       -       827
197 Current_Pending_Sector  0x0022   089   089   000    Old_age   Always       -       364
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     30754         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So, from the highlighted output above we see that the Raw_Read_Error_Rate indicator for this hard drive is pretty close to the threshold of 16. The SMART metrics really are just “health indicators” following the defined standards (read the wiki article). The health indicator value can range from 0 to 253. For the Raw_read_Error_Rate metric (which measures the physical read success from the disk surface), bigger value is better (apparently 100 is the max with the Hitachi disks at least, most of the other disks were showing around 90-100). So whenever there are media read errors, the metric will drop, the more errors, the more it will drop.

Apparently some of the read errors are inevitable (and detected by various checks like ECC), especially in the high-density disks. The errors will be corrected/worked around, sometimes via ECC, sometimes by a re-read. So, yes, your hard drive performance can get worse as disks age or are about to fail. If the metric ever goes below the defined threshold of 16, the disk apparently (and consistently over some period of time) isn’t working so great, so it should better be replaced.

Note that the RAW_VALUE column does not neccesarily show the number of failed reads from the disk platter. It may represent the number of sectors that failed to be read or it may be just a bitmap – or both combined into the low- and high-order bytes of this value. For example, when converting the raw value of 430833663 to hex, we get 0x19ADFFFF. Perhaps the low-order FFFF is some sort of a bitmap and the high 0rder 0x19AD is the number of failed sectors or reads. There’s some more info available about Seagate disks, but in our V2 we have Hitachi ones and I couldn’t find anything about how to decode the RAW_VALUE for their disks. So, we just need to trust that the “normalized” SMART health indicators for the different metrics tell us when there’s a problem.

Even though when I ran the smartctl command, I did not see the actual value (nor the worst value) crossing the threshold, 28 is still pretty close to the threshold 16, considering that normally the indicator should be close to 100. So my guess here is that the indicator actually did cross the threshold, this is when the alert got raised. It’s just that by the time I logged in and ran my diagnostics commands, the disk worked better again. It looks like the “worst” values are not remembered properly by the disks (or it could be that some SMART tool resets these every now and then). Note that we would see SMART alerts with the actual problem metric values in the Linux /var/log/messages file if the smartd service were enabled in the Storage Cell Linux OS – but apparently it’s disabled and probably some Oracle’s own daemon in the cell is monitoring that.

So what does this info tell us – a low “health indicator” for the Raw_Read_Error_Rate means that there are problems with physically reading the sequences of bits from the disk platter. This means bad sectors or weak sectors (that are soon about to become bad sectors probably). So, had we seen a bad health state for UDMA_CRC_Error_Count for example, it would have indicated a data transfer issue over the SATA cable. So, it looks like the reason for the disk being in the predictive failure state is about it having just too many read errors from the physical disk platter.

If you look into the other highlighted metrics above – the Reallocated_Sector_Ct and Current_Pending_Sector, you see there are hundreds of disk sectors (743 and 364) that have had IO issues, but eventually the reads finished ok and the sectors were migrated (remapped) to a spare disk area. As these disks have 512B sector size, this means that some Oracle block-size IOs from a single logical sector range may actually have to read part of the data from the original location and seek to some other location on the disk for reading the rest (from the remapped sector). So, again, your disk performance may get worse when your disk is about to fail or is just having quality issues.

For reference, here’s an example from another, healthier disk in this Exadata storage cell:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   086   086   016    Pre-fail  Always       -       2687039
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       99
  3 Spin_Up_Time            0x0007   119   119   024    Pre-fail  Always       -       601 (Average 610)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       68
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   114   114   020    Pre-fail  Offline      -       38
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30778
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       68
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       81
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       21 (Min/Max 16/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

The Raw_Read_Error_Rate indicator still shows 86 (% from ideal?), but it’s much farther away from the threshold 16. Many of the other disks showed even 99 or 100 and apparently this metric value changed as the disk behavior changed. Some disks with value of 88 jumped to 100 and an hour later they were at 95 and so on. So the VALUE column allows real time monitoring of these internal disk metrics, the only thing that doesn’t make sense right now is that why does teh WORST column get reset over time.

For the better behaving disk, the Reallocated_Sector_Ct and Current_Pending_Sector metrics show zero, so this disk doesn’t seem to have bad or weak sectors (yet).

I hope that this post is another example that it is possible to dig deeper, but only when the piece of software or hardware is properly instrumented, of course. If it didn’t have such instrumentation, it would be way harder (you would have to take a stethoscope and record the noise of the hard drive for analysis or open the drive in dustless vacuum and see what it’s doing yourself ;-)

Note that I will be talking about systematic Exadata Performance troubleshooting (and optimization) my Advanced Exadata Performance online seminar on 16-20. December.

cell flash cache read hits vs. cell writes to flash cache statistics on Exadata

$
0
0

When the Smart Flash Cache was introduced in Exadata, it was caching reads only. So there were only read “optimization” statistics like cell flash cache read hits and physical read requests/bytes optimized in V$SESSTAT and V$SYSSTAT (the former accounted for the read IO requests that got its data from the flash cache and the latter ones accounted the disk IOs avoided both thanks to the flash cache and storage indexes). So if you wanted to measure the benefit of flash cache only, you’d have to use the cell flash cache read hits metric.

This all was fine until you enabled the Write-Back flash cache in a newer version of cellsrv. We still had only the “read hits” statistic in the V$ views! And when investigating it closer, both the read hits and write hits were accumulated in the same read hits statistic! (I can’t reproduce this on our patched 11.2.0.3 with latest cellsrv anymore, but it was definitely the behavior earlier, as I demoed it in various places).

Side-note: This is likely because it’s not so easy to just add more statistics to Oracle code within a single small patch. The statistic counters are referenced by other modules using macros with their direct numeric IDs (and memory offsets to v$sesstat array) and the IDs & addresses would change when more statistics get added. So, you can pretty much add new statistic counters only with new full patchsets, like 11.2.0.4. It’s the same with instance parameters by the way, that’s why the “spare” statistics and spare parameters exist, they’re placeholders for temporary use, until the new parameter or statistic gets added permanently with a full patchset update.

So, this is probably the reason why both the flash cache read and write hits got initially accumulated under the cell flash cache read hits statistic, but later on this seemed to get “fixed”, so that the read hits only showed read hits and the flash write hits were not accounted anywhere. You can test this easily by measuring your DBWR’s v$sesstat metrics with snapper for example, if you get way more cell flash cache read hits than physical read total IO requests, then you’re probably accumulating both read and write hits in the same metric.

Let’s look into a few different database versions:

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  db12c1       enkdb03.enkitec.com       1497  20671    12.1.0.1.0 20131127

SQL> @sys cell%flash

NAME                                                                                  VALUE
---------------------------------------------------------------- --------------------------
cell flash cache read hits                                                          1874361

In the 12.1.0.1 database above, we still have only the read hits metric. But in the Oracle 11.2.0.4 output below, we finally have the flash cache IOs broken down by reads and writes, plus a few special metrics indicating if the block written to already existed in the flash cache (cell overwrites in flash cache) and when the block range written to flash was only partially cached in flash already when the DB issued the write (cell partial writes in flash cache):

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  dbm012       enkdb02.enkitec.com       199   607      11.2.0.4.0 20131201

SQL> @sys cell%flash

NAME                                                                                  VALUE
---------------------------------------------------------------- --------------------------
cell writes to flash cache                                                           711439
cell overwrites in flash cache                                                       696661
cell partial writes in flash cache                                                        9
cell flash cache read hits                                                           699240

So, this probably means that the upcoming Oracle 12.1.0.2 will have the flash cache write hit metrics in it too. So in the newer versions there’s no need to get creative when estimating the write-back flash cache hits in our performance scripts (the Exadata Snapper currently tries to derive this value from other metrics, relying on the bug where both read and write hits accumulated under the same metric, so I will need to update it based on the DB version we are running on).

So, when I look into one of the DBWR processes in a 11.2.0.4 DB on Exadata, I see the breakdown of flash read vs write hits:

SQL> @i

USERNAME             INST_NAME    HOST_NAME                 SID   SERIAL#  VERSION    STARTED 
-------------------- ------------ ------------------------- ----- -------- ---------- --------
SYS                  dbm012       enkdb02.enkitec.com       199   607      11.2.0.4.0 20131201

SQL> @exadata/cellver
Show Exadata cell versions from V$CELL_CONFIG....

CELL_PATH            CELL_NAME            CELLSRV_VERSION      FLASH_CACHE_MODE     CPU_COUNT 
-------------------- -------------------- -------------------- -------------------- ----------
192.168.12.3         enkcel01             11.2.3.2.1           WriteBack            16        
192.168.12.4         enkcel02             11.2.3.2.1           WriteBack            16        
192.168.12.5         enkcel03             11.2.3.2.1           WriteBack            16        

SQL> @ses2 "select sid from v$session where program like '%DBW0%'" flash

       SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
       296 cell writes to flash cache                                            50522
       296 cell overwrites in flash cache                                        43998
       296 cell flash cache read hits                                               36

SQL> @ses2 "select sid from v$session where program like '%DBW0%'" optimized

       SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
       296 physical read requests optimized                                         36
       296 physical read total bytes optimized                                  491520
       296 physical write requests optimized                                     25565
       296 physical write total bytes optimized                              279920640

If you are wondering that why is the cell writes to flash cache metric roughly 2x bigger than the physical write requests optimized, it’s because of the ASM double mirroring we use. The physical writes metrics are counted at the database-scope IO layer (KSFD), but the ASM mirroring is done at a lower layer in the Oracle process codepath (KFIO). So when the DBWR issues a 1 MB write, v$sesstat metrics would record a 1 MB IO for it, but the ASM layer at the lower level would actually do 2 or 3x more IO due to double- or triple-mirroring. As the cell writes to flash cache metric is actually sent back from all storage cells involved in the actual (ASM-mirrored) write IOs, then we will see more around 2-3x storage flash write hits, than physical writes issued at the database level (depending on which mirroring level you use). Another way of saying this would be that the “physical writes” metrics are measured at higher level, “above” the ASM mirroring and the “flash hits” metrics are measured at a lower level, “below” the ASM mirroring in the IO stack.

Oracle X$ tables – Part 1 – Where do they get their data from?

$
0
0

It’s long-time public knowledge that X$ fixed tables in Oracle are just “windows” into Oracle’s memory. So whenever you query an X$ table, the FIXED TABLE rowsource function in your SQL execution plan will just read some memory structure, parse its output and show you the results in tabular form. This is correct, but not the whole truth.

Check this example. Let’s query the X$KSUSE table, which is used by V$SESSION:

SQL> SELECT addr, indx, ksuudnam FROM x$ksuse WHERE rownum <= 5;

ADDR           INDX KSUUDNAM
-------- ---------- ------------------------------
391513C4          1 SYS
3914E710          2 SYS
3914BA5C          3 SYS
39148DA8          4 SYS
391460F4          5 SYS

Now let’s check in which Oracle memory region this memory address resides (SGA, PGA, UGA etc). I’m using my script fcha for this (Find CHunk Address). You should probably not run this script in busy production systems as it uses the potentially dangerous X$KSMSP fixed table:

SQL> @fcha 391513C4
Find in which heap (UGA, PGA or Shared Pool) the memory address 391513C4 resides...

WARNING!!! This script will query X$KSMSP, which will cause heavy shared pool latch contention
in systems under load and with large shared pool. This may even completely hang
your instance until the query has finished! You probably do not want to run this in production!

Press ENTER to continue, CTRL+C to cancel...

LOC KSMCHPTR   KSMCHIDX   KSMCHDUR KSMCHCOM           KSMCHSIZ KSMCHCLS   KSMCHTYP KSMCHPAR
--- -------- ---------- ---------- ---------------- ---------- -------- ---------- --------
SGA 39034000          1          1 permanent memor     3977316 perm              0 00

SQL>

Ok, these X$KSUSE (V$SESSION) records reside in a permanent allocation in SGA and my X$ query apparently just parsed & presented the information from there.

Now, let’s query something else, for example the “Soviet Union” view X$KCCCP:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F692347C          0          1          1
F692347C          1          1          2
F692347C          2          1          3
F692347C          3          1          4
F692347C          4          1          5

Ok, let’s see where do these records reside:

SQL> @fcha F692347C
Find in which heap (UGA, PGA or Shared Pool) the memory address F692347C resides...

WARNING!!! This script will query X$KSMSP, which will cause heavy shared pool latch contention
in systems under load and with large shared pool. This may even completely hang
your instance until the query has finished! You probably do not want to run this in production!

Press ENTER to continue, CTRL+C to cancel...

LOC KSMCHPTR   KSMCHIDX   KSMCHDUR KSMCHCOM           KSMCHSIZ KSMCHCLS   KSMCHTYP KSMCHPAR
--- -------- ---------- ---------- ---------------- ---------- -------- ---------- --------
UGA F6922EE8                       kxsFrame4kPage         4124 freeabl           0 00

SQL>

Wow, why does the X$KCCCP data reside in my session’s UGA? This is where the extra complication (and sophistication) of X$ fixed tables comes into play!

Some X$ tables do not simply read whatever is in some memory location, but they have helper functions associated with them (something like fixed packages that the ASM instance uses internally). So, whenever you query this X$, then first a helper function is called, which will retrieve the source data from whereever it needs to, then copies it to your UGA in the format corresponding to this X$ and then the normal X$ memory location parsing & presentation code kicks in.

If you trace what the X$KCCCP access does – you’d see a bunch of control file parallel read wait events every time you query the X$ table (to retrieve the checkpoint progress records). So this X$ is not doing just a passive read only presentation of some memory structure (array). The helper function will first do some real work, allocates some runtime memory for the session (the kxsFrame4kPage chunk in UGA) and copies the results of its work to this UGA area – so that the X$ array & offset parsing code can read and present it back to the query engine.

In other words, the ADDR column in X$ tables does not necessarily show where the source data it shows ultimately lives, but just where the final array that got parsed for presentation happened to be. Sometimes the parsed data structure is the ultimate source where it comes from, sometimes a helper function needs to do a bunch of work (like taking latches and walking linked lists for X$KSMSP or even doing physical disk reads from controlfiles for X$KCCCP access).

And more, let’s run the same query against X$KCCCP twice:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F69254B4          0          1          1
F69254B4          1          1          2
F69254B4          2          1          3
F69254B4          3          1          4
F69254B4          4          1          5

And once more:

SQL> SELECT addr, indx, inst_id, cptno FROM x$kcccp WHERE rownum <= 5;

ADDR           INDX    INST_ID      CPTNO
-------- ---------- ---------- ----------
F692B508          0          1          1
F692B508          1          1          2
F692B508          2          1          3
F692B508          3          1          4
F692B508          4          1          5

See how the ADDR column has changed between executions even though we are querying the same data! This is not because the controlfiles or the source data have somehow relocated. It’s just that the temporary cursor execution scratch area, where the final data structure was put for presentation (kxsFrame4kPage chunk in UGA), just happened to be allocated from different locations for the two different executions.

There may be exceptions, but as long as the ADDR resides in SGA, I’d say it’s the actual location of where the data lives – but when it’s in UGA/PGA, it may be just the temporary cursor scratch area and the source data was taken from somewhere else (especially when the ADDR constantly changes or alternates between 2-3 different variants when repeatedly running your X$ query). Note that there are X$ tables which intentionally read data from arrays in your UGA (the actual source data lives in the UGA or PGA itself), but more about that in the future.

Hotsos Symposium 2014

$
0
0

After missing last year’s Hotsos Symposium (trying to cut my travel as you know :), I will present at and deliver the full-day Training Day at this year’s Hotsos Symposium! It will be my 10th time to attend (and speak at) this awesome conference. So I guess this means more beer than usual. Or maybe less, as I’m getting old. Let’s make it as usual, then :0)

I have (finally) sent the abstract and the TOC of the Training Day to Hotsos folks and they’ve been uploaded. So, check out the conference sessions and the training day contents here. I aim to keep my training day very practical – I’ll be just showing how I troubleshoot most issues that I hit, with plenty of examples. It will be suitable both for developers and DBAs. In the last part of the training day I will talk about some Oracle 12c internals and will dive a bit deeper to the lower levels of troubleshooting so we can have some fun too.

Looks like we’ll be having some good time!


Where does the Exadata storage() predicate come from?

$
0
0

On Exadata (or when setting cell_offload_plan_display = always on non-Exadata) you may see the storage() predicate in addition to the usual access() and filter() predicates in an execution plan:

SQL> SELECT * FROM dual WHERE dummy = 'X';

D
-
X

Check the plan:

SQL> @x
Display execution plan for last statement for this session from library cache...

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  dtjs9v7q7zj1g, child number 0
-------------------------------------
SELECT * FROM dual WHERE dummy = 'X'

Plan hash value: 272002086

------------------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows |E-Bytes| Cost (%CPU)|
------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |       |     2 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| DUAL |      1 |     2 |     2   (0)|
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("DUMMY"='X')
       filter("DUMMY"='X')

The access() and filter() predicates come from the corresponding ACCESS_PREDICATES and FILTER_PREDICATES columns in V$SQL_PLAN. But there’s no STORAGE_PREDICATES column there!

SQL> @desc v$sql_plan
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      ADDRESS                                  RAW(4)
    2      HASH_VALUE                               NUMBER
    3      SQL_ID                                   VARCHAR2(13)
  ...
   33      TEMP_SPACE                               NUMBER
   34      ACCESS_PREDICATES                        VARCHAR2(4000)
   35      FILTER_PREDICATES                        VARCHAR2(4000)
   36      PROJECTION                               VARCHAR2(4000)
  ...
   40      OTHER_XML                                CLOB

So where does the storage predicate come from then?

The answer is that there is no storage() predicate column in any V$ views. The storage() predicate actually comes from the ACCESS_PREDICATE column, but the DBMS_XPLAN.DISPLAY functions just have extra logic in them that if the execution plan line (OPTION column in V$SQL_PLAN) contains STORAGE string, then any access() predicates for that line must be storage() predicates instead!

SQL> SELECT id, access_predicates,filter_predicates FROM v$sql_plan WHERE sql_id = 'dtjs9v7q7zj1g' AND child_number = 0;

        ID ACCESS_PREDICATES    FILTER_PREDICATES
---------- -------------------- --------------------
         0
         1 "DUMMY"='X'          "DUMMY"='X'

This actually makes sense, as the filter() predicates are the “dumb brute-force” predicates that are not able to pass any information (about what values are they looking for) inside the access path row source they are filtering. In other words, a filter() function fetches all the rows from its rowsource and it throws away everything that doesn’t match the filter condition.

The access() predicate, on the other hand, is able to pass in the value (or range) it’s looking for inside its row source. For example, when doing an index unique lookup, the access() predicate can send the value your query is looking for right into the index traversing code, so you only retrieve the rows you want as opposed to retrieving everything and throwing the non-wanted rows away.

So the access() predicate traditionally showed up for index access paths and also hash join row sources, but never for full table scans. Now, with Exadata, even full table scans can work in a smart way (allowing you pass in the values you’re looking for into the storage layer), so some of the full scanning row sources support the access() predicate now too – with the catch that if the OPTION column in V$SQL_PLAN contains “STORAGE”, the access() predicates are shown as storage().

Note that the SQL Monitor reports (to my knowledge) still don’t support this display logic, so you would see row sources like TABLE ACCESS STORAGE FULL with filter() and access() predicates on them – the access() on these STORAGE row sources really means storage()

Slides of my previous presentations

$
0
0

Here are the slides of some of my previous presentations (that I haven’t made public yet, other than delivering these at conferences and training sessions):

Scripts and Tools That Make Your Life Easier and Help to Troubleshoot Better:

  • I delivered this presentation at the Hotsos Symposium Training Day in year 2010:

 

Troubleshooting Complex Performance Issues – Part1:

 

Troubleshooting Complex Performance Issues – Part2

 

Oracle Memory Troubleshooting, Part 4: Drilling down into PGA memory usage with V$PROCESS_MEMORY_DETAIL

$
0
0

If you haven’t read them – here are the previous articles in Oracle memory troubleshooting series: Part 1Part 2, Part 3.

Let’s say you have noticed that one of your Oracle processes is consuming a lot of private memory. The V$PROCESS has PGA_USED_MEM / PGA_ALLOC_MEM columns for this. Note that this view will tell you what Oracle thinks it’s using – how much of allocated/freed bytes it has kept track of. While this doesn’t usually tell you the true memory usage of a process, as other non-Oracle-heap allocation routines and the OS libraries may allocate (and leak) memory of their own, it’s a good starting point and usually enough.

Then, the V$PROCESS_MEMORY view would allow you to see a basic breakdown of that process’es memory usage – is it for SQL, PL/SQL, Java, unused (Freeable) or for “Other” reasons. You can use either the smem.sql or pmem.sql scripts for this (report v$process_memory for a SID or OS PID):

SQL> @smem 198
Display session 198 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED
---------- ---------- ---------- --------------- ---------- ---------- -------------
       198         43         17 Freeable           1572864          0
       198         43         17 Other              5481102                  5481102
       198         43         17 PL/SQL                2024        136          2024
       198         43         17 SQL              117805736  117717824     118834536

From the above output we see that this session has allocated over 100MB of private memory for “SQL” reasons. This normally means SQL workareas, so we can break this down further by querying V$SQL_WORKAREA_ACTIVE that shows us all currently in-use cursor workareas in the instance. I’m using a script wrka.sql for convenience – and listing only my SID-s workareas:

SQL> @wrka sid=198
Show Active workarea memory usage for where sid=198...

   INST_ID        SID  QCINST_ID      QCSID SQL_ID        OPERATION_TYPE                  PLAN_LINE POLICY                   ACTIVE_SEC ACTUAL_MEM_USED MAX_MEM_USED WORK_AREA_SIZE NUMBER_PASSES TEMPSEG_SIZE TABLESPACE
---------- ---------- ---------- ---------- ------------- ------------------------------ ---------- ------------------------ ---------- --------------- ------------ -------------- ------------- ------------ ------------------------------
         1        198                       ff8v9qhv21pm5 SORT (v2)                               1 AUTO                           14.6        64741376    104879104       97623040             0   2253389824 TEMP
         1        198                       ff8v9qhv21pm5 HASH-JOIN                               6 AUTO                           14.8         1370112      1370112        2387968             0
         1        198                       ff8v9qhv21pm5 BUFFER                                 25 AUTO                           14.8        11272192     11272192       11272192             0

The ACTUAL_MEM_USED column above shows the currently used memory by this workarea (that happens to be a SORT (v2) operation in that cursor’s execution plan line #1). It was only about 64MB at the time I got to query this view, but the MAX_MEM_USED shows it was about 100MB at its peak. This can happen due to multipass operations where the merge phase may use less memory than the sort phase or once the sorting completed and the rowsource was ready to start sending sorted rows back, not that much memory would have been needed anymore for just buffering the blocks read from TEMP (the sort_area_size vs sort_area_retained_size thing from past).

For completeness, I also have a script called wrkasum.sql that summarizes the workarea memory usage of all sessions in an instance (so if you’re not interested in a single session, but rather a summary of which operation types tend to consume most memory etc) you can use that:

SQL> @wrkasum
Top allocation reason by PGA memory usage

OPERATION_TYPE      POLICY      ACTUAL_PGA_MB ALLOWED_PGA_MB    TEMP_MB NUM_PASSES     NUM_QC NUM_SESSIONS 
------------------- ----------- ------------- -------------- ---------- ---------- ---------- ------------ 
SORT (v2)           AUTO                   58            100       1525          0          1            1            
BUFFER              AUTO                   11             11                     0          1            1            
HASH-JOIN           AUTO                    1              2                     0          1            1

You may want to modify the script to change the GROUP BY to SQL_ID you want to list the top workarea-memory consuming SQL statement across the whole instance (or any other column of interest – like QC_INST_ID/QCSID).

But what about the following example:

SQL> @pmem 27199
Display process memory usage for SPID 27199...

       SID SPID                            PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ------------------------ ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516 27199                           120        198 Freeable            786432          0                        0
      1516 27199                           120        198 Other            842807461                842807461          0
      1516 27199                           120        198 PL/SQL              421064      77296        572344          0
      1516 27199                           120        198 SQL                2203848      50168       2348040          0

Most of the memory (over 800MB) is consumed by category “Other”?! Not that helpful, huh? V$SQL_WORKAREA_ACTIVE didn’t show anything either as it deals only with SQL workareas and not all the other possible reasons why an Oracle process might allocate memory.

So we need a way to drill down into the Other category and see which allocation reasons have taken the most of this memory. Historically this was only doable with a PGA/UGA memory heapdump and by aggregating the resulting dumpfile. You have to use oradebug to get the target process to dump its own private memory breakdown as it’s private memory and other processes can not just read it directly. I have written about it in Part 1 of the Oracle memory troubleshooting series.

Update: an alternative to ORADEBUG is to use ALTER SESSION SET EVENTS ‘immediate trace name pga_detail_get level N’ where N is the Oracle PID of the process. 

However starting from Oracle 10.2 you can get similar detailed breakdown info by querying V$PROCESS_MEMORY_DETAIL, no need for post-processing tracefiles! However when you just query it, the view does not return any rows:

SQL> SELECT * FROM v$process_memory_detail;

no rows selected

Again this is for the abovementioned reasons – your current process can not just read the contents of some other process’es private memory – the OS ensures that. You will have to ask that target process to populate the V$PROCESS_MEMORY_DETAIL with its memory allocation breakdown. You can do this by using the ORADEBUG DUMP PGA_DETAIL_GET command:

SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG DUMP PGA_DETAIL_GET 49
Statement processed.

The number 49 above is the Oracle PID (v$process.pid) of the target process I want to examine. The oradebug PGA_DETAIL_GET command will not immediately make the target process to report its usage – it will merely set a flag somewhere and the target process itself checks it when it is active. In other words, if the target process is idle or sleeping for a long time (due to some lock for example), then it won’t populate the V$ view with required data. In my test environment, the V$PROCESS_MEMORY_DETAIL got populated only after I ran another dummy command in the target session. This shouldn’t be an issue if you are examining a process that’s actively doing something (and not idle/sleeping for a long time).

The output below is from another dummy demo session that wasn’t using much of memory:

SQL> SELECT * FROM v$process_memory_detail ORDER BY pid, bytes DESC;

       PID    SERIAL# CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT HEAP_DES PARENT_H
---------- ---------- --------------- -------------------------- --------------- ---------- ---------------- -------- --------
        49          5 Other           permanent memory           pga heap            162004               19 11B602C0 00
        49          5 SQL             QERHJ Bit vector           QERHJ hash-joi      131168                8 F691EF4C F68F6F7C
        49          5 Other           kxsFrame4kPage             session heap         57736               14 F68E7134 11B64780
        49          5 SQL             free memory                QERHJ hash-joi       54272                5 F691EF4C F68F6F7C
        49          5 Other           free memory                pga heap             41924                8 11B602C0 00
        49          5 Other           miscellaneous                                   39980              123 00       00
        49          5 Other           Fixed Uga                  Fixed UGA heap       36584                1 F6AA44B0 11B602C0
        49          5 Other           permanent memory           top call heap        32804                2 11B64660 00
        49          5 Other           permanent memory           session heap         32224                2 F68E7134 11B64780
        49          5 Other           free memory                top call heap        31692                1 11B64660 00
        49          5 Other           kgh stack                  pga heap             17012                1 11B602C0 00
        49          5 Other           kxsFrame16kPage            session heap         16412                1 F68E7134 11B64780
        49          5 Other           dbgeInitProcessCtx:InvCtx  diag pga             15096                2 F75A8630 11B602C0
...

The BYTES column shows the sum of memory allocated from private memory heap HEAP_NAME for the reason shown in NAME column. If you want to know the average allocation (chunk) size in the heap, divide BYTES by ALLOCATION_COUNT.
For example, the top PGA memory user in that process is an allocation called “permanent memory”, 162004 bytes taken straight from the top-level “pga-heap”. It probably contains all kinds of low-level runtime allocations that the process needs for its own purposes. It may be possible to drill down into the subheaps inside that allocation with the Oracle memory top-5 subheap dumping I have written about before.

The 2nd biggest memory user is in category SQL – “QERHJ Bit vector” allocation, 131168 bytes allocated in 8 chunks of ~16kB each (on average). QERHJ should mean Query Execution Row-source Hash-Join and the hash join bit vector is a hash join optimization (somewhat like a bloom filter on hash buckets) – Jonathan Lewis has written about this in his CBO book.

I do have a couple of scripts which automate running the ORAEDBUG command, waiting for a second so that the target process would have a chance to publish its data in the V$PROCESS_MEMORY_DETAIL and then query it. Check out smem_detail.sql and pmem_detail.sql.

Now, let’s look into a real example from a problem case – a stress test environment on Oracle 12c:

SQL> @smem 1516
Display session 1516 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516        120        198 Freeable            786432          0                        0
      1516        120        198 Other            844733773                844733773          0
      1516        120        198 PL/SQL              421064      77296        572344          0
      1516        120        198 SQL                 277536      45904       2348040          0

The Other memory usage of a session has grown to over 800MB!

Let’s drill down deeper. The script warns that it’s experimental and asks you to press enter to continue as it’s using ORADEBUG. I haven’t seen any problems with it, but use it at your own risk (and stay away from critical background processes on production systems)!

SQL> @smem_detail 1516

WARNING! About to run an undocumented ORADEBUG command
for getting heap details.
This script is EXPERIMENTAL, use at your own risk!

Press ENTER to continue, or CTRL+C to cancel

PL/SQL procedure successfully completed.

STATUS
----------
COMPLETE

If the status above is not COMPLETE then you need to wait
for the target process to do some work and re-run the
v$process_memory_detail query in this script manually
(or just take a heapdump level 29 to get heap breakdown
in a tracefile)

       SID CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT
---------- --------------- -------------------------- --------------- ---------- ----------------
      1516 Other           permanent memory           qmxlu subheap    779697376           203700
      1516 Other           free memory                qmxlu subheap     25960784           202133
      1516 Other           XVM Storage                XVM subheap of     5708032               51
      1516 Other           free memory                session heap       2722944              598
      1516 Other           permanent memory           pga heap            681992               36
      1516 Other           qmushtCreate               qmtmInit            590256                9
      1516 Other           free memory                top uga heap        449024              208
      1516 Other           qmtmltAlloc                qmtmInit            389680             1777
      1516 Other           permanent memory           kolarsCreateCt      316960               15
      1516 Other           free memory                pga heap            306416               17
      1516 Other           miscellaneous                                  297120              105
      1516 Other           permanent memory           qmxtgCreateBuf      279536               73
      1516 Other           free memory                koh dur heap d      239312              134
      1516 Other           kxsFrame4kPage             session heap        232512               56
      1516 Other           permanent memory           qmcxdDecodeIni      228672               21
      1516 Other           permanent memory           qmxtigcp:heap       215936              730
      1516 Other           permanent memory           session heap        189472               28
      1516 Other           free memory                lpxHeap subhea      182760               32
      1516 Other           kfioRqTracer               pga heap            131104                1
      1516 Other           free memory                top call heap       129312                4
      1516 PL/SQL          recursive addr reg file    koh-kghu sessi      110592               10
      1516 Other           free memory                callheap            109856                4
      1516 Other           koh-kghu session heap      session heap         88272               36
      1516 Other           Fixed Uga                  pga heap             72144                1
      1516 PL/SQL          PL/SQL STACK               PLS PGA hp           68256                4
...

Well, there you go – the power of measuring & profiling. Most of that big memory usage comes from something called qmxlu subheap. Now, while this name is cryptic and we don’t know what it means – we are already half-way there, we at least know what to focus on now. We can ignore all the other hundreds of cryptic memory allocations in the output and just try to figure out what “qmxlu subheap” is. A quick MOS search might just tell it and if there are known bugs related to this memory leak, you might just find what’s affecting you right away (as Oracle support analysts may have pasted some symptoms, patch info and workarounds into the bug note):

memory_mos_notes

Indeed, there are plenty of results in MOS and when browsing through them to find one matching our symptoms and environment the closest, I looked into this: ORA-4030 With High Allocation Of “qmxdpls_subheap” (Doc ID 1509914.1). It came up in the search as the support analyst had pasted a recursive subheap dump containing our symptom – “qmxlu subheap” there:

Summary of subheaps at depth 2
5277 MB total:
 5277 MB commented, 128 KB permanent
 174 KB free (110 KB in empty extents),
   2803 MB, 1542119496 heaps:   "               "
   1302 MB, 420677 heaps:   "qmxlu subheap  "
    408 MB, 10096248 chunks:  "qmxdplsArrayGetNI1        " 2 KB free held
    385 MB, 10096248 chunks:  "qmxdplsArrayNI0           " 2 KB free held

In this note, the reference bug had been closed as “not a bug” and hinted that it may be an application issue (an application “object” leak) instead of an internal memory leak that causes this memory usage growth.

Cause:

The cause of this problem has been identified in:
unpublished Bug:8918821 – MEMORY LEAK IN DBMS_XMLPARSER IN QMXDPLS_SUBHEAP
closed as “not a bug”. The problem is caused by the fact that the XML document is created with XMLDOM.CREATEELEMENT, but after creation XMLDOM.FREEDOCUMENT is not called. This causes the XML used heaps to remain allocated. Every new call to XMLDOM.CREATEELEMENT will then allocate a new heap, causing process memory to grow over time, and hence cause the ORA-4030 error to occur in the end.

Solution:

To implement a solution for this issue, use XMLDOM.FREEDOCUMENT to explicitly free any explicitly or implictly created XML document, so the memory associated with that document can be released for reuse.

And indeed, in our case it turned out that it was an application issue – the application did not free the XMLDOM documents after use, slowly accumulating more and more open document memory structures, using more memory and also more CPU time (as, judging by the ALLOCATION_COUNT figure in smem_detail output above, the internal array used for managing the open document structures had grown to 203700). Once the application object leak issue was fixed, the performance and memory usage problem went away.

Summary:

V$PROCESS_MEMORY_DETAIL allows you to conveniently dig deeper into process PGA memory usage. The alternative is to use Oracle heapdumps. A few more useful comments about it are in an old Oracle-L post.

Normally my process memory troubleshooting & drilldown sequence goes like that (usually only steps 1-2 are enough, 3-4 are rarely needed):

  1. v$process / v$process_memory / top / ps
  2. v$sql_workarea_active
  3. v$process_memory_detail or heapdump_analyzer
  4. pmap -x at OS level

#1,2,3 above can show you “session” level memory usage (assuming that you are using dedicated servers with 1-1 relationship between a session and a process) and #4 can show you a different view into the real process memory usage from the OS perspective.

Even though you may see cryptic allocation reason names in the output, if reason X causes 95% of your problem, you’ll need to focus on finding out what X means and don’t need to waste time on anything else. If there’s an Oracle bug involved, a MOS search by top memory consumer names would likely point you to the relevant bug right away.

Oracle troubleshooting is fun!

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?

$
0
0

There was an interesting question in Oracle-L about the JOXSHM_EXT_* files in /dev/shm directory on Linux. Basically something like this:

$ ls -l /dev/shm/* | head
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_0_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_100_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_101_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_102_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_103_LIN112_1409029
-rwxrwx--- 1 oracle dba 36864 Apr 18 10:23 /dev/shm/JOXSHM_EXT_104_LIN112_1409029
...

There are a few interesting MOS articles about these files and how/when to get rid of those (don’t remove any files before reading the notes!), but none of these articles explain why these JOXSHM (and PESHM) files are needed at all:

  • /dev/shm Filled Up With Files In Format JOXSHM_EXT_xxx_SID_xxx (Doc ID 752899.1)
  • Stale Native Code Files Are Being Cached with File Names Such as: JOXSHM_EXT*, PESHM_EXT*, PESLD* or SHMDJOXSHM_EXT* (Doc ID 1120143.1)
  • Ora-7445 [Ioc_pin_shared_executable_object()] (Doc ID 1316906.1)

Here’s an explanation, a bit more elaborated version of what I already posted in Oracle-L:

The JOX files are related to Oracle’s in-database JVM JIT compilation. So, instead of interpreting the JVM bytecode during runtime, Oracle compiles it to archictecture-specific native binary code – just like compiling C code with something like gcc would do. So the CPUs can execute that binary code directly without any interpretation layers in between.

Now the question is that how do we load that binary code into our own process address space – so that the CPUs could execute this stuff directly?

This is why the JOX files exist. When the JIT compilation is enabled (it’s on by default on Oracle 11g), then the java code you access in the database will be compiled to machine code and saved in to the JOX files. Each Java class or method gets its own file (I haven’t checked which is it exactly). And then your Oracle process maps those files into its address space with a mmap() system call. So, any time this compiled java code has to be executed, your Oracle process can just jump to the compiled method address (and return back later).

Let’s do a little test:

SQL> SHOW PARAMETER jit

PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- -----
java_jit_enabled                                             boolean     TRUE

Java just-in-time compilation is enabled. Where the JOX files are put by default is OS-specific, but on Linux they will go to /dev/shm (the in-memory filesystem) unless you specify some other directory with the _ncomp_shared_objects_dir parameter (and you’re not hitting one of the related bugs).

So let’s run some Java code in the Database:

SQL> SELECT DBMS_JAVA.GETVERSION FROM dual;

GETVERSION
--------------------------------------------
11.2.0.4.0

After this execution a bunch of JOX files showed up in /dev/shm:

$ ls -l /dev/shm | head
total 77860
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_0_LIN112_229381
-rwxrwx--- 1 oracle dba        12288 May  9 22:13 JOXSHM_EXT_10_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_11_LIN112_229381
-rwxrwx--- 1 oracle dba         8192 May  9 22:13 JOXSHM_EXT_12_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_13_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_14_LIN112_229381
...

When I check that process’es address space with pmap, I see that some of these JOX files are also mapped into its address space:

oracle@oel6:~$ sudo pmap -x 33390
33390:   oracleLIN112 (LOCAL=NO)
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000008048000  157584   21268       0 r-x--  oracle
0000000011a2c000    1236     372      56 rw---  oracle
0000000011b61000     256     164     164 rw---    [ anon ]
0000000013463000     400     276     276 rw---    [ anon ]
0000000020000000    8192       0       0 rw-s-  SYSV00000000 (deleted)
0000000020800000  413696       0       0 rw-s-  SYSV00000000 (deleted)
0000000039c00000    2048       0       0 rw-s-  SYSV16117e54 (deleted)
00000000420db000     120     104       0 r-x--  ld-2.12.so
00000000420f9000       4       4       4 r----  ld-2.12.so
00000000420fa000       4       4       4 rw---  ld-2.12.so
00000000420fd000    1604     568       0 r-x--  libc-2.12.so
000000004228e000       8       8       8 r----  libc-2.12.so
0000000042290000       4       4       4 rw---  libc-2.12.so
0000000042291000      12      12      12 rw---    [ anon ]
0000000042296000      92      52       0 r-x--  libpthread-2.12.so
00000000422ad000       4       4       4 r----  libpthread-2.12.so
00000000422ae000       4       4       4 rw---  libpthread-2.12.so
00000000422af000       8       4       4 rw---    [ anon ]
00000000422b3000      12       8       0 r-x--  libdl-2.12.so
00000000422b6000       4       4       4 r----  libdl-2.12.so
00000000422b7000       4       4       4 rw---  libdl-2.12.so
00000000f63b9000       4       4       4 rwxs-  JOXSHM_EXT_88_LIN112_229381
00000000f63ba000      16      16      16 rwxs-  JOXSHM_EXT_91_LIN112_229381
00000000f63be000       4       4       4 rwxs-  JOXSHM_EXT_90_LIN112_229381
00000000f63bf000       4       4       4 rwxs-  JOXSHM_EXT_89_LIN112_229381

...

Note the X and S bits (in the rwxs-) for the JOX mapped segments, this means that the Linux virtual memory manager allows the contents of these mapped files to be directly executed by the CPU and the S means its a shared mapping (other processes can map this binary code into their address spaces as well).

Oracle can also load some of its binary libraries into its address space with the dynamic dlopen() system call, but I verified using strace that the JOX files are “loaded” into the address space with just a mmap() syscall:

33390 open("/dev/shm/JOXSHM_EXT_85_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c2000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_87_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c0000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_89_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63bf000
33390 close(8)                          = 0
...

Just to check if these JOX files really are compiled machine-code instructions:

$ file /dev/shm/JOXSHM_EXT_88_LIN112_229381
/dev/shm/JOXSHM_EXT_88_LIN112_229381: data
$

Oops, the file command doesn’t recognize any specific file format from its contents as it’s a bare slice of machinecode and doesn’t contain the normal object module stuff the .o files would have… let’s try to disassemble this file and see if contains sensible instructions:

$ objdump -b binary -m i386 -D /dev/shm/JOXSHM_EXT_88_LIN112_229381 | head -30

/dev/shm/JOXSHM_EXT_88_LIN112_229381:     file format binary


Disassembly of section .data:

00000000 :
       0:	f7 15 1e 33 00 00    	notl   0x331e
       6:	00 00                	add    %al,(%eax)
       8:	55                   	push   %ebp
       9:	89 e5                	mov    %esp,%ebp
       b:	83 c4 c8             	add    $0xffffffc8,%esp
       e:	8b 45 10             	mov    0x10(%ebp),%eax
      11:	89 c1                	mov    %eax,%ecx
      13:	83 e1 f8             	and    $0xfffffff8,%ecx
      16:	89 4d e0             	mov    %ecx,-0x20(%ebp)
      19:	8b 4d 08             	mov    0x8(%ebp),%ecx
      1c:	89 45 f0             	mov    %eax,-0x10(%ebp)
      1f:	8b 45 0c             	mov    0xc(%ebp),%eax
      22:	89 5d ec             	mov    %ebx,-0x14(%ebp)
      25:	89 7d e8             	mov    %edi,-0x18(%ebp)
      28:	89 75 e4             	mov    %esi,-0x1c(%ebp)
      2b:	c7 45 f8 02 00 00 40 	movl   $0x40000002,-0x8(%ebp)
      32:	8b 91 1d 02 00 00    	mov    0x21d(%ecx),%edx
      38:	8b 7d 14             	mov    0x14(%ebp),%edi
      3b:	89 45 f4             	mov    %eax,-0xc(%ebp)
      3e:	8d 45 f0             	lea    -0x10(%ebp),%eax
      41:	83 c2 04             	add    $0x4,%edx
      44:	89 91 1d 02 00 00    	mov    %edx,0x21d(%ecx)
      4a:	39 d0                	cmp    %edx,%eax
...

This file contains machine code indeed!

So this is how Oracle approaches native JIT compilation for the in-database JVM. In 11g it’s similar to the PL/SQL native compilation too (you’d see various PESHM_ files in /dev/shm). Before 11g, Oracle actually generated intermediate C code for your PL/SQL and then invoked an OS C compiler on it, but in 11g it’s all self-contained in the database code. Pretty cool!

 

Enkitec + Accenture = Even More Awesomeness!

$
0
0

Enkitec is the best consulting firm for hands on implementation, running and troubleshooting your Oracle based systems, especially the engineered systems like Exadata. We have a truly awesome group of people here; many are the best in their field (just look at the list!!!).

This is why I am here.

This is also why Accenture approached us some time ago – and you may already have seen today’s announcement that Enkitec got bought!

We all are now part of Accenture and this opens up a whole lot of new opportunities. I think this is BIG, and I will explain how I see the future (sorry, no Oracle Database internals in this post ;-)

In my opinion the single most important detail of this transaction is that both Enkitec and the folks at Accenture realize that the reason Enkitec is so awesome is that awesome techies want to work here. And we don’t just want to keep it that way – we must keep it that way!

The Enkitec group will not be dissolved into the Accenture. If it were, we would disappear, like a drop in the ocean and Accenture would have lost its investment. Instead we will remain an island in the ocean continuing to provide expert help for our existing and new customers – and in long term help Accenture build additional capability for the massive projects of their customers.

We will not have ten thousand people in our group. Instead we will continue hiring (and retaining) people exactly the way we’ve been – organic growth by having only the best, likeminded people. The main difference is, now with Accenture behind us, we can hire the best people globally, as we’ll have operations in over 50 countries. I understand that we won’t likely even double in size in the next few years – as we plan to stick to hiring only the best.

I think we will have a much, much wider reach now, showing how to do Oracle technology “our way” all around the world. With Accenture behind us, we will be navigating through even larger projects in larger businesses, influencing things earlier and more. And on a more personal note, I’m looking forward to all those 10 rack Exadata and 100TB In-Memory DB Option performance projects ;-)

See you at Enkitec E4 in June!

 

Viewing all 76 articles
Browse latest View live