Combining Bloom Filter Offloading and Storage Indexes on Exadata

May 17, 2014, 9:15 pm

≫ Next: Our take on the Oracle Database 12c In-Memory Option

≪ Previous: Enkitec + Accenture = Even More Awesomeness!

Here’s a little known feature of Exadata – you can use a Bloom filter computed from a join column of a table to skip disk I/Os against another table it is joined to. This not the same as the Bloom filtering of the datablock contents in Exadata storage cells, but rather avoiding reading in some storage regions from the disks completely.

So, you can use storage indexes to skip I/Os against your large fact table, based on a bloom filter calculated from a small dimension table!

This is useful especially for dimensional star schemas, as your SQL statements might not have direct predicates on your large fact tables at all, all results will be determined by looking up relevant dimension records and then performing a hash join to the fact table (whether you should have some direct predicates against the fact tables, for performance reasons, is a separate topic for some other day :-)

Let me show an example using the SwingBench Order Entry schema. The first output is from Oracle 11.2.0.3 BP21 on Cellsrv 12.1.1.1.0:

SQL> ALTER SESSION SET "_serial_direct_read"=ALWAYS;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME  ...
----------- ------------------------------ --------------
  399999199 brooks                         laxton        

Elapsed: 00:00:55.81

You can ignore the hints in the query, I just used these for get the plan I wanted for my demo. I also forced the serial full segment scans to use direct path reads (and thus attempt Smart Scans on Exadata) using the _serial_direct_read parameter.

Note that while I do have a direct filter predicate on the “small” CUSTOMERS table, I don’t have any predicates on the “large” ORDERS table, so there’s no filter predicate to offload to storage layer on the ORDERS table (but the column projection and HCC decompression can still be offloaded on that table too). Anyway, the query ran in over 55 seconds.

Let’s run this now with PARALLEL degree 2 and compare the results:

SQL> ALTER SESSION FORCE PARALLEL QUERY PARALLEL 2;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME       
----------- ------------------------------ ---------------------
  399999199 brooks                         laxton               

Elapsed: 00:00:03.80

Now the query ran in less than 4 seconds. How come did the same query run close to 15 times faster than in serial? The parallel degree 2 for this simple query should give me max 4 slaves doing the work… 15x speedup indicates that something else has changed as well.

The first thing I would normally suspect is that perhaps the direct path reads were not attempted for the serial query (and Smart Scans did not kick in). That would sure explain the big performance difference… but I did explicitly force the serial direct path reads in my session. Anyway, let’s stop guessing and instead know for sure by measuring these experiments!

SQL Monitoring reports are a good starting point (I have added the MONITOR hint to the query so that the SQL monitoring would kick in immediately also for serial queries).

Here’s the slow serial query:

Global Stats
====================================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
====================================================================================================
|      56 |      49 |     6.64 |        0.01 |     0.01 |     2 |     4M | 53927 |  29GB |  14.53% |
====================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2205859845)
==========================================================================================================================================
| Id |               Operation               |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                       |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |
==========================================================================================================================================
|  0 | SELECT STATEMENT                      |               |     1 |        1 |       |         |          |                           |
|  1 |   HASH JOIN                           |               |     1 |        1 |       |         |    42.86 | Cpu (24)                  |
|  2 |    TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |
|  3 |     INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |
|  4 |    PARTITION HASH ALL                 |               |     1 |     458M |       |         |          |                           |
|  5 |     TABLE ACCESS STORAGE FULL         | ORDERS        |    64 |     458M |  29GB |  14.53% |    57.14 | Cpu (29)                  |
|    |                                       |               |       |          |       |         |          | cell smart table scan (3) |
==========================================================================================================================================

So, the serial query issued 29 GB worth of IO to the Exadata storage cells, but only 14.53% less data was sent back… not that great reduction in the storage interconnect traffic. Also, looks like all 458 Million ORDERS table rows were sent back from the storage cells (as the ORDERS table didn’t have any direct SQL predicates against it). Those rows were then fed to the HASH JOIN parent row source (DB CPU usage!) and it just threw all of the rows away but one. Talk about inefficiency!

Ok, this is the fast parallel query:

Global Stats
==============================================================================================================
| Elapsed | Queuing |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
==============================================================================================================
|    4.88 |    0.00 |    0.84 |     4.03 |        0.00 |     0.01 |     2 |     4M | 30056 |  29GB |  99.99% |
==============================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2121763430)
===============================================================================================================================================
| Id |                 Operation                  |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                            |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |       
===============================================================================================================================================
|  0 | SELECT STATEMENT                           |               |     3 |        1 |       |         |          |                           |       
|  1 |   PX COORDINATOR                           |               |     3 |        1 |       |         |          |                           |       
|  2 |    PX SEND QC (RANDOM)                     | :TQ10001      |     2 |        1 |       |         |          |                           |       
|  3 |     HASH JOIN                              |               |     2 |        1 |       |         |          |                           |       
|  4 |      BUFFER SORT                           |               |     2 |        2 |       |         |          |                           |       
|  5 |       PX RECEIVE                           |               |     2 |        2 |       |         |          |                           |       
|  6 |        PX SEND BROADCAST                   | :TQ10000      |     1 |        2 |       |         |          |                           |       
|  7 |         TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |       
|  8 |          INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |       
|  9 |      PX PARTITION HASH ALL                 |               |     2 |     4488 |       |         |          |                           |       
| 10 |       TABLE ACCESS STORAGE FULL            | ORDERS        |    64 |     4488 |  29GB |  99.99% |   100.00 | cell smart table scan (4) |
===============================================================================================================================================

The fast parallel query still issued 29 GB of IO, but only 0.01% of data (compared to the amount of IOs issued) was sent back from the storage cells, so the offload efficiency for that table scan is 99.99%. Also, only a few thousand rows were returned back from the full table scan (so the throw-away by the hash join later in the plan is smaller).

So, where does the big difference in runtime and offload efficiency metrics come from? It’s the same data, same query (looking for the same values) so how come do we have so different Offload Efficiency %?

The problem with looking into a single metric showing some percentage (of what?!) is that it hides a lot of detail. So, lets get systematic and look into some detailed metrics :-)

I’ll use the Exadata Snapper for this purpose, although I could just list the V$SESSTAT metrics it uses as its source. The ExaSnapper lists various Exadata I/O metrics of the monitored session all in one “chart”, in the same unit (MB) & scale – so it will be easy to compare these different cases. If you do not know what the Exadata Snapper is, then check out this article and video.

Here’s the slow serial query (you may need to scroll the output right to see the MB numbers):

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |################################################# |    29134 MB    363 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |                                                  |        0 MB      0 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB    365 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |##################################################|    29478 MB    368 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSDR_BYTES says this session submitted 29291 MB of I/O as far as the DB layer sees it. The PRED_OFFLOADABLE_BYTES says that all of this 29291 MB of I/O was attempted to be done in a smart way (smart scan). The SMART_SCAN_RET_BYTES tells us that 24993 MB worth of data was sent back from the storage cells as a result of the smart scan (about 14.5% less than the amount of IO issued).

However, the PHYRD_STORIDX_SAVED_BYTES shows zero, we couldn’t avoid doing (skip) any I/Os when scanning the ORDERS table. We couldn’t use the storage index as we did not have any direct filter predicates on the ORDERS table.

Anyway, the story is different for the fast parallel query:

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));     

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )              
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |#######################                           |    13321 MB   1006 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |###########################                       |    15700 MB   1186 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB   2212 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#######################                           |    13591 MB   1027 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSIO_BYTES is still the same (29GB) as in the previous case – as this database layer metric knows only about the amount of I/Os it has requested, not what actually gets (or doesn’t get) done inside the storage cells. SMART_SCAN_RET_BYTES is only 4MB (compared to almost 25GB previously) so evidently there must be some early filtering going on somewhere in the lower layers.

And now to the main topic of this article – the PHYRD_STORIDX_SAVED_BYTES metric (that comes from the cell physical IO bytes saved by storage index metric in V$SESSTAT) shows that we have managed to avoid doing 15700 MB worth of IO completely thanks to the storage indexes. And this is without having any direct SQL filter predicates on the large ORDERS table! How is this possible? Keep reading :-)

Before going deeper, let’s look into the predicate section of these cursors.

The slow serial query:

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | E-Rows | Pstart| Pstop |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |        |       |       |
|*  1 |  HASH JOIN                          |               |      1 |       |       |
|   2 |   TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  3 |    INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   4 |   PARTITION HASH ALL                |               |    506M|     1 |    64 |
|   5 |    TABLE ACCESS STORAGE FULL        | ORDERS        |    506M|     1 |    64 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')

The fast parallel query:

-------------------------------------------------------------------------------------------
| Id  | Operation                                | Name          | E-Rows | Pstart| Pstop |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |               |        |       |       |
|   1 |  PX COORDINATOR                          |               |        |       |       |
|   2 |   PX SEND QC (RANDOM)                    | :TQ10001      |      1 |       |       |
|*  3 |    HASH JOIN                             |               |      1 |       |       |
|   4 |     BUFFER SORT                          |               |        |       |       |
|   5 |      PX RECEIVE                          |               |      1 |       |       |
|   6 |       PX SEND BROADCAST                  | :TQ10000      |      1 |       |       |
|   7 |        TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  8 |         INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   9 |     PX PARTITION HASH ALL                |               |    506M|     1 |    64 |
|* 10 |      TABLE ACCESS STORAGE FULL           | ORDERS        |    506M|     1 |    64 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   8 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')
  10 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

See how the parallel plan has the storage() predicate with SYS_OP_BLOOM_FILTER in it and the bloom filter :BF0000 will be compared to the hashed CUSTOMER_ID column values when scanning the ORDERS table in storage cells. So, this shows that your session attempts to push the Bloom filter down to the storage layer.

However, merely seeing the Bloom filter storage predicate in the plan doesn’t tell you whether any IOs were avoided thanks to the storage indexes and Bloom filters. The only way to know would be to run the query and look into the cell physical IO bytes saved by storage index metric (or PHYRD_STORIDX_SAVED_BYTES in Exadata Snapper). Unfortunately it will not be easy to measure this at rowsource level when you have multiple smart scans happening in different locations of your execution plan (more about this in the next blog entry :-)

Anyway, how is Oracle able to avoid reading in blocks of one table based on the output of another table involved in the join?

The answer lies in the parameter _bloom_minmax_enabled. It’s description says “enable or disable bloom min max filtering”. This parameter is true by default and on Exadata it means that in addition to computing the Bloom filter bitmap (based on hashed values in the join column) we also keep track of the smallest and biggest value retrieved from the driving table of the join.

In our example, both the Bloom filter bitmap of all matching CUSTOMER_IDs retrieved from the CUSTOMERS table (after any direct predicate filtering) would be sent to the storage cells and also the biggest and smallest CUSTOMER_ID retrieved from the driving table of the join.

In our hand-crafted example where only a single customer was taken from the driving table, only one bit in the bloom filter would be set and both the MIN and MAX CUSTOMER_ID of interest (in the joined ORDERS table) was set to 399999199 (I had about 400 Million customers generated in my benchmark dataset).

So, knowing that we were looking for only CUSTOMER_IDs in the single value “range” of 399999199, the smart scan could start skipping IOs thanks to the storage indexes in memory! You can think of this as runtime BETWEEN predicate generation from a driving table in a join that then gets offloaded to the storage cell and applied when scanning the other table in the join. IOs can be avoided thanks to knowing the range of values we are looking for and then further row-level filtering can be done using the usual Bloom filter bitmap comparison.

Note that the above examples are from Oracle 11.2.0.3 – in this version the Bloom filtering features should kick in only for parallel queries (although Jonathan Lewis has blogged about a case when this happens also on 11.2.0.3).

And this is the ultimate reason why the serial execution was so much slower and less efficient than the parallel run – in my demos, on 11.2.0.3 the Bloom filter usage kicked in only when running the query in parallel.

However, this has changed on Oracle 11.2.0.4 – now also serial queries frequently compute bloom filters and push these to the storage, for the usual bloom filtering reasons and for IO skipping by comparing the bloom filter min/max values to the Exadata storage index memory structures.

This is an example from Oracle 11.2.0.4, the plan is slightly different because in this database my test tables are not partitioned (and I didn’t have any indexes on the tables):

------------------------------------------------------------------
| Id  | Operation                   | Name      | E-Rows |E-Bytes|
------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |        |       |
|*  1 |  HASH JOIN                  |           |      1 |   114 |
|   2 |   JOIN FILTER CREATE        | :BF0000   |      1 |    64 |
|*  3 |    TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |    64 |
|   4 |   JOIN FILTER USE           | :BF0000   |   4581K|   218M|
|*  5 |    TABLE ACCESS STORAGE FULL| ORDERS    |   4581K|   218M|
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - storage("C"."CUST_EMAIL"='florencio@ivtboge.com')
       filter("C"."CUST_EMAIL"='florencio@ivtboge.com')
   5 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

The Bloom bitmap (and MIN/MAX value) would be computed in the step #2 and would then be used by the full table scan in the step #5 – having the filtering info pushed all the way to the storage layer (assuming that the smart scan did kick in).

If you look into the Outline hints section of this plan, you’ll see the PX_JOIN_FILTER hint that has showed up there, this instructs the optimizer to set up and use the bloom filters (and despite the name, it now can be used for serial queries):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
      DB_VERSION('11.2.0.4')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "C"@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      LEADING(@"SEL$1" "C"@"SEL$1" "O"@"SEL$1")
      USE_HASH(@"SEL$1" "O"@"SEL$1")
      PX_JOIN_FILTER(@"SEL$1" "O"@"SEL$1")
      END_OUTLINE_DATA
  */

So if the CBO doesn’t choose it automatically, you can use this hint for testing whether your queries could potentially benefit from this feature (note that it controls the whole join filter propagation logic, not only the Exadata stuff). You can also use the _bloom_serial_filter parameter to disable this behavior on Exadata (not that you should).

Note that I usually treat the storage index IO savings as just an added performance benefit at the system level, something that allows to keep the disk devices a little less busy, therefore yielding higher overall throughput. I do not design my applications’ performance overly dependent on the storage indexes as it’s not possible to easily control which columns end up in the storage index hashtables and that may give you unpredictable results.

That’s all – hopefully this shed some light on this cool combination of multiple different features (Smart Scans + Hash Joins + Bloom Filters + Storage Indexes).

↧

Our take on the Oracle Database 12c In-Memory Option

June 10, 2014, 2:26 pm

≫ Next: About index range scans, disk re-reads and how your new car can go 600 miles per hour!

≪ Previous: Combining Bloom Filter Offloading and Storage Indexes on Exadata

Enkitec folks have been beta testing the Oracle Database 12c In-Memory Option over the past months and recently the Oracle guys interviewed Kerry Osborne, Cary Millsap and me to get our opinions. In short, this thing rocks!

We can’t talk much about the technical details before Oracle 12.1.0.2 is officially out in July, but here’s the recorded interview that got published at Oracle website as part of the In-Memory launch today:

Alternatively go to Oracle webpage:

http://www.oracle.com/us/products/database/options/database-in-memory/overview/index.html

Just scroll down to the Overview section that says: Video: Database Industry Experts Discuss Oracle Database In-Memory (11:10)

I might actually be even more excited about the In-Memory Option than I was excited about Exadata years ago. The In-Memory Option is not just a performance feature, it’s a simplifying feature too. So, now it’s ok to kill your performance problem with hardware, as long as you use it in a smart way :-)

↧

About index range scans, disk re-reads and how your new car can go 600 miles per hour!

September 17, 2014, 1:56 am

≫ Next: Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

≪ Previous: Our take on the Oracle Database 12c In-Memory Option

Despite the title, this is actually a technical post about Oracle, disk I/O and Exadata & Oracle In-Memory Database Option performance. Read on :)

If a car dealer tells you that this fancy new car on display goes 10 times (or 100 or 1000) faster than any of your previous ones, then either the salesman is lying or this new car is doing something radically different from all the old ones. You don’t just get orders of magnitude performance improvements by making small changes.

Perhaps the car bends space around it instead of moving – or perhaps it has a jet engine built on it (like the one below :-) :

Anyway, this blog entry is a prelude to my upcoming Oracle In-Memory Database Option series and here I’ll explain one of the radical differences between the old way of thinking and modern (In-Memory / Smart Scan) thinking that allow such performance improvements.

To set the scope and and clarify what I mean by the “old way of thinking”: I am talking about reporting, analytics and batch workloads here – and the decades old mantra “if you want more speed, use more indexes”.

I’m actually not going to talk about the In-Memory DB option here – but I am going to walk you through the performance numbers of one index range scan. It’s a deliberately simple and synthetic example executed on my laptop, but it should be enough to demonstrate one important point.

Let’s say we have a report that requires me to visit 20% of rows in an orders table and I’m using an index range scan to retrieve these rows (let’s not discuss whether that’s wise or not just yet). First, I’ll give you some background information about the table and index involved.

My test server’s buffer cache is currently about 650 MB:

SQL> show sga

Total System Global Area 2147483648 bytes
Fixed Size                  2926472 bytes
Variable Size             369100920 bytes
Database Buffers          687865856 bytes
Redo Buffers               13848576 bytes
In-Memory Area           1073741824 bytes

The table I am accessing is a bit less than 800 MB in size, about 100k blocks:

SQL> @seg soe.orders

    SEG_MB OWNER  SEGMENT_NAME   SEGMENT_TYPE    BLOCKS 
---------- ------ -------------  ------------- -------- 
       793 SOE    ORDERS         TABLE           101504

I have removed some irrelevant output from the output below, I will be using the ORD_WAREHOUSE_IX index for my demo:

SQL> @ind soe.orders
Display indexes where table or index name matches %soe.orders%...

TABLE_OWNER  TABLE_NAME  INDEX_NAME         POS# COLUMN_NAME     DSC
------------ ----------- ------------------ ---- --------------- ----
SOE          ORDERS      ORDER_PK              1 ORDER_ID
                         ORD_WAREHOUSE_IX      1 WAREHOUSE_ID
                                               2 ORDER_STATUS

INDEX_OWNER  TABLE_NAME  INDEX_NAME        IDXTYPE    UNIQ STATUS   PART TEMP  H  LFBLKS       NDK   NUM_ROWS      CLUF LAST_ANALYZED     DEGREE VISIBILIT
------------ ----------- ----------------- ---------- ---- -------- ---- ---- -- ------- --------- ---------- --------- ----------------- ------ ---------
SOE          ORDERS      ORDER_PK          NORMAL/REV YES  VALID    NO   N     3   15801   7148950    7148950   7148948 20140913 16:17:29 16     VISIBLE
             ORDERS      ORD_WAREHOUSE_IX  NORMAL     NO   VALID    NO   N     3   17860      8685    7148950   7082149 20140913 16:18:03 16     VISIBLE

I am going to do an index range scan on the WAREHOUSE_ID column:

SQL> @descxx soe.orders

Col# Column Name                    Null?      Type                      NUM_DISTINCT        Density  NUM_NULLS HISTOGRAM       NUM_BUCKETS Low Value                        High Value
---- ------------------------------ ---------- ------------------------- ------------ -------------- ---------- --------------- ----------- -------------------------------- --------------------------------
   1 ORDER_ID                       NOT NULL   NUMBER(12,0)                   7148950   .00000013988          0                           1 1                                7148950
...
   9 WAREHOUSE_ID                              NUMBER(6,0)                        999   .00100100100          0                           1 1                                999
...

Also, I enabled SQL trace and event 10298 – “ORA-10298: ksfd i/o tracing”, more about that later:

SQL> ALTER SESSION SET EVENTS '10298 trace name context forever, level 1';

Session altered.

SQL> EXEC SYS.DBMS_MONITOR.SESSION_TRACE_ENABLE(waits=>TRUE);

PL/SQL procedure successfully completed.

SQL> SET AUTOTRACE ON STAT

Ok, now we are ready to run the query! (It’s slightly formatted):

SQL> SELECT /*+ MONITOR INDEX(o, o(warehouse_id)) */ 
         SUM(order_total) 
     FROM 
         soe.orders o 
     WHERE 
         warehouse_id BETWEEN 400 AND 599;

Let’s check the basic autotrace figures:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
    1423335  consistent gets
     351950  physical reads
          0  redo size
        347  bytes sent via SQL*Net to client
        357  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

What?! We have done 351950 physical reads?! This is 351950 blocks read via physical read operations. This is about 2.7 GB worth of IOs done just for this query! Our entire table size was under 800MB and the index size under 150MB. Shouldn’t indexes allow us to visit less blocks than the table size?!

Let’s dig deeper – by breaking down this IO number by execution plan line (using a SQL Monitoring report in this case):

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|      48 |      25 |       23 |     1 |     1M | 352K |   3GB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=16715356)
=============================================================================================================================================
| Id |               Operation                |       Name       | Execs |   Rows   | Read | Read  | Activity |       Activity Detail       |
|    |                                        |                  |       | (Actual) | Reqs | Bytes |   (%)    |         (# samples)         |
=============================================================================================================================================
|  0 | SELECT STATEMENT                       |                  |     1 |        1 |      |       |          |                             |
|  1 |   SORT AGGREGATE                       |                  |     1 |        1 |      |       |          |                             |
|  2 |    TABLE ACCESS BY INDEX ROWID BATCHED | ORDERS           |     1 |       1M | 348K |   3GB |    96.30 | Cpu (1)                     |
|    |                                        |                  |       |          |      |       |          | db file parallel read (25)  |
|  3 |     INDEX RANGE SCAN                   | ORD_WAREHOUSE_IX |     1 |       1M | 3600 |  28MB |     3.70 | db file sequential read (1) |
=============================================================================================================================================

So, most of these IOs come from accessing the table (after fetching relevant ROWIDs from the index). 96% of response time of this query was also spent in that table access line. We have done about ~348 000 IO requests for fetching blocks from this table. This is over 3x more blocks than the entire table size! So we must be re-reading some blocks from disk again and again for some reason.

Let’s confirm if we are having re-reads. This is why I enabled the SQL trace and event 10298. I can just post-process the tracefile and see if IO operations with the same file# and block# combination do show up.

However, using just SQL trace isn’t enough because multiblock read wait events don’t show all blocks read (you’d have to infer this from the starting block# and count#), the “db file parallel read” doesn’t show any block#/file# info at all in SQL Trace (as this “vector read” wait event encompasses multiple different block reads under a single wait event).

The classic single block read has the file#/block# info:

WAIT #139789045903344: nam='db file sequential read' ela= 448 file#=2 block#=1182073 blocks=1 obj#=93732 tim=156953721029

The parallel read wait events don’t have individual file#/block# info (just total number of files/blocks involved):

WAIT #139789045903344: nam='db file parallel read' ela= 7558 files=1 blocks=127 requests=127 obj#=93696 tim=156953729450

Anyway, because we had plenty of db file parallel read waits that don’t show all the detail in SQL Trace, I also enabled the event 10298 that gives us following details below (only tiny excerpt below):

...
ksfd_osdrqfil:fob=0xce726160 bufp=0xbd2be000 blkno=1119019 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edb00 fob=0xce726160 aiopend=126
ksfd_osdrqfil:fob=0xce726160 bufp=0x9e61a000 blkno=1120039 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edd80 fob=0xce726160 aiopend=127
ksfdwtio:count=127 aioflags=0x500 timeout=2147483647 posted=(nil)
...
ksfdchkio:ksfdrq=0x7f232c4edb00 completed=1
ksfdchkio:ksfdrq=0x7f232c4edd80 completed=0
WAIT #139789045903344: nam='db file parallel read' ela= 6872 files=1 blocks=127 requests=127 obj#=93696 tim=156953739197

So, on Oracle 12.1.0.2 on Linux x86_64 with xfs filesystem with async IO enabled and filesystemio_options = SETALL we get the “ksfd_osdrqfil” trace entries to show us the block# Oracle read from a datafile. It doesn’t show the file# itself, but it shows the accessed file state object address (FOB) in SGA and as it was always the same in the tracefile, I know duplicate block numbers listed in trace would be for the same datafile (and not for a block with the same block# in some other datafile). And the tablespace I used for my test had a single datafile anyway.

Anyway, I wrote a simple script to summarize whether there were any disk re-reads in this tracefile (of a select statement):

$ grep ^ksfd_osdrqfil LIN121_ora_11406.trc | awk '{ print $3 }' | sort | uniq -c | sort -nr | head -20
     10 blkno=348827
     10 blkno=317708
      9 blkno=90493
      9 blkno=90476
      9 blkno=85171
      9 blkno=82023
      9 blkno=81014
      9 blkno=80954
      9 blkno=74703
      9 blkno=65222
      9 blkno=63899
      9 blkno=62977
      9 blkno=62488
      9 blkno=59663
      9 blkno=557215
      9 blkno=556581
      9 blkno=555412
      9 blkno=555357
      9 blkno=554070
      9 blkno=551593
...

Indeed! The “worst” blocks have been read in 10 times – all that for a single query execution.

I only showed 20 top blocks here, but even when I used “head -10000″ and “head -50000″ above, I still saw blocks that had been read in to buffer cache 8 and 4 times respectively.

Looking into earlier autotrace metrics, my simple index range scan query did read in over 3x more blocks than the total table and index size combined (~350k blocks read while the table had only 100k blocks)! Some blocks have gotten kicked out from buffer cache and have been re-read back into cache later, multiple times.

Hmm, let’s think further: We are accessing only about 20% of a 800 MB table + 150 MB index, so the “working set” of datablocks used by my query should be well less than my 650 MB buffer cache, right? And as I am the only user in this database, everything should nicely fit and stay in buffer cache, right?

Actually, both of the arguments above are flawed:

Accessing 20% of rows in a table doesn’t automatically mean that we need to visit only 20% blocks of that table! Maybe all of the tables’s blocks contain a few of the rows this index range scan needs? So we might need to visit all of that table’s blocks (or most of them) and extract only a few matching rows from each block. But nevertheless, the “working set” of required blocks for this query would include almost all of the table blocks, not only 20%. We must read all of them in at some point in the range scan.So, the matching rows in table blocks are not tightly packed and physically in correspondence with the index range scan’s table access driving order, but are potentially “randomly” scattered all over the table.This means that an index range scan may come back and access some data block again and again to get a yet-another row from it when the ROWID entries in index leaf blocks point there. This is what I call buffer re-visits. (Now scroll back up and see what is that index’es clustering factor :-)
So what, all the buffer re-visits should be really fast as the previously read block is going to be in buffer cache, right?Well, not really. Especially when the working set of blocks read is bigger than buffer cache. But even if it is smaller, the Oracle buffer cache isn’t managed using basic LRU replacement logic (since 8.1.6). New blocks that get read in to buffer cache will be put into the middle of the “LRU” list and they work their way up to the “hot” end only if they are touched enough times before someone manages to flush them out. So even if you are a single user of the buffer cache, there’s a chance that some just recently read blocks get aged out from buffer cache – by the same query still running – before they get hot enough. And this means that your next buffer re-visit may turn into a disk block re-read that we saw in the tracefiles.If you combine this with the reality of production systems where there’s a thousand more users trying to do what you’re doing, at the same time, it becomes clear that you’ll be able to use only a small portion of the total buffer cache for your needs. This is why people sometimes configure KEEP pools – not that the KEEP pool is somehow able to keep more blocks in memory for longer per GB of RAM, but simply for segregating the less important troublemakers from more important… troublemakers :)

So what’s my point here – in the context of this blog post’s title?

Let’s start from Exadata – over the last years it has given many customers order(s) of magnitude better analytics, reporting and batch performance compared to their old systems, if done right of course. In other words, instead of indexing even more, performing wide index range scans with millions of random block reads and re-reads, they ditched many indexes and started doing full table scans. Full table scans do not have such “scaling problems” like a wide index range scan (or a “wide” nested loop join driving access to another table). In addition you got all the cool stuff that goes really well with full scans – multiblock reads, deep prefetching, partition-wise hash joins, partition pruning and of course all the throughput and Smart Scan magic on Exadata).

An untuned complex SQL on a complex schema with lots of non-ideal indexes may end up causing a lot of “waste IO” (don’t have a better term) and similarly CPU usage too. And often it’s not simple to actually fix the query – as it may end up needing a significant schema adjustment/redesign that would require also changing the application code in many different places (ain’t gonna happen). With defaulting reporting to full table scans, you can actually eliminate a lot of such waste, assuming that you have a high-througput – and ideally smart – IO subsystem. (Yes, there are always exceptions and special cases).

We had a customer who had a reporting job that ran almost 2000x faster after moving to Exadata (from 17 hours to 30 seconds or something like that). Their first reaction was: “It didn’t run!” Indeed it did run and it ran correctly. Such radical improvement came from the fact that the new system – compared to the old system – was doing multiple things radically better. It wasn’t just an incremental tweak of adding a hint or a yet another index without daring to do more significant changes.

In this post I demoed just one of the problems that’s plaguing many of the old-school Oracle DW and reporting systems. While favoring full table scanning had always been counterintuitive for most Oracle shops out there, it was the Exadata’s hardware, software and also the geek-excitement surrounding it, what allowed customers to take the leap and switch from the old mindset to new. I expect the same from the Oracle In-Memory Database Option. More about this in a following post.

↧

Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

October 5, 2014, 10:51 pm

≫ Next: Public Appearances 2015

≪ Previous: About index range scans, disk re-reads and how your new car can go 600 miles per hour!

This is the first entry in a series of random articles about some useful internals-to-know of the awesome Oracle Database In-Memory column store. I intend to write about Oracle’s IM stuff that’s not already covered somewhere else and also about some general CPU topics (that are well covered elsewhere, but not always so well known in the Oracle DBA/developer world).

Before going into further details, you might want to review the Part 0 of this series and also our recent Oracle Database In-Memory Option in Action presentation with some examples. And then read this doc by Intel if you want more info on how the SIMD registers and instructions get used.

There’s a lot of talk about the use of your CPUs’ SIMD vector processing capabilities in the Oracle inmemory module, let’s start by checking if it’s enabled in your database at all. We’ll look into Linux/Intel examples here.

The first generation of SIMD extensions in Intel Pentium world were called MMX. It added 8 new XMMn registers, 64 bits each. Over time the registers got widened, more registers and new features were added. The extensions were called Streaming SIMD Extensions (SSE, SSE2, SSSE3, SSE4.1, SSE4.2) and Advanced Vector Extensions (AVX and AVX2).

The currently available AVX2 extensions provide 16 x 256 bit YMMn registers and the AVX-512 in upcoming King’s Landing microarchitecture (year 2015) will provide 32 x 512 bit ZMMn registers for vector processing.

So how to check which extensions does your CPU support? On Linux, the “flags” column in /proc/cpuinfo easily provides this info.

Let’s check the Exadatas in our research lab:

Exadata V2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

So the highest SIMD extension support on this Exadata V2 is SSE4.2 (No AVX!)

Exadata X2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Exadata X2 also has SSE4.2 but no AVX.

Exadata X3:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Exadata X3 supports the newer AVX too.

My laptop (Macbook Pro late 2013):
The Exadata X4 has not yet arrived to our lab, so I’m using my laptop as an example of a latest available CPU with AVX2:

Update: Jason Arneil commented that the X4 does not have AVX2 capable CPUs (but the X5 will)

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
avx2
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Core-i7 generation supports everything up to the current AVX2 extension set.

So, which extensions is Oracle actually using? Let’s check!

As Oracle needs to run different binary code on CPUs with different capabilities, some of the In-Memory Data (kdm) layer code has been duplicated into separate external libraries – and then gets dynamically loaded into Oracle executable address space as needed. You can run pmap on one of your Oracle server processes and grep for libshpk:

$ pmap 21401 | grep libshpk
00007f0368594000   1604K r-x--  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368725000   2044K -----  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368924000     72K rw---  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so

My (educated) guess is that the “shpk” in libshpk above stands for oS dependent High Performance [K]ompression. “s” prefix normally means platform dependent (OSD) code and this low-level SIMD code sure is platform and CPU microarchitecture version dependent stuff.

Anyway, the above output from an Exadata X2 shows that SSE4.2 SIMD HPK libraries are used on this platform (and indeed, X2 CPUs do support SSE4.2, but not AVX).

Let’s list similar files from $ORACLE_HOME/lib:

$ cd $ORACLE_HOME/lib
$ ls -l libshpk*.so
-rw-r--r-- 1 oracle oinstall 1818445 Jul  7 04:16 libshpkavx12.so
-rw-r--r-- 1 oracle oinstall    8813 Jul  7 04:16 libshpkavx212.so
-rw-r--r-- 1 oracle oinstall 1863576 Jul  7 04:16 libshpksse4212.so

So, there are libraries for AVX and AVX2 in the lib directory too (the “12” suffix for all file names just means Oracle version 12). The AVX2 library is almost empty though (and the nm/objdump commands don’t show any Oracle functions in it, unlike in the other files).

Let’s run pmap on a process in my new laptop (which supports AVX and AVX2 ) to see if the AVX2 library gets used:

$ pmap 18969 | grep libshpk     
00007f85741b1000   1560K r-x-- libshpkavx12.so
00007f8574337000   2044K ----- libshpkavx12.so
00007f8574536000     72K rw--- libshpkavx12.so

Despite my new laptop supporting AVX2, only the AVX library is used (the AVX2 library is named libshpkavx212.so). So it looks like the AVX2 extensions are not used yet in this version (it’s the first Oracle 12.1.0.2 GA release without any patches). I’m sure this will be added soon, along with more features and bugfixes.

To be continued …

↧

Public Appearances 2015

February 10, 2015, 2:14 pm

≫ Next: Oracle Exadata Performance: Latest Improvements and Less Known Features

≪ Previous: Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

Here’s where I’ll hang out in the following months:

11-12 Feb 2015: IOUG Exadata SIG Virtual Conference (free online event)

Presentation: Exadata Performance: Latest Improvements and Less Known Features
It’s a free online event, so sign up here

18-19 Feb 2015: RMOUG Training Days (in Denver)

I won’t speak there this year, but plan to hang out on Wednesday evening and drink beer
More info here

1-5 March 2015: Hotsos Symposium 2015

Presentation: Troubleshooting Yet Another Complex Performance Issue
The SQL tuning training day by Carlos Sierra and Mauro Pagano should be very good (I plan to hang out there too)
More info here

31 May – 2 June 2015: Enkitec E4

Even more awesome Exadata (and now also Hadoop) content there!
I plan to speak there again, about Exadata performance and/or integrating Oracle databases with Hadoop
More info here

Advanced Oracle Troubleshooting v3.0 training

One of the reasons why I’ve been so quiet in recent months is that I’ve been rebuilding my entire Advanced Oracle Troubleshooting training material from ground up.
This new seminar focuses on systematic Oracle troubleshooting and internals of database versions all the way to Oracle 12c.
I will launch the AOT seminar v3.0 in early March – you can already register your interest here!

Public Appearances and Exadata Performance Training

↧

Oracle Exadata Performance: Latest Improvements and Less Known Features

March 24, 2015, 7:57 am

≫ Next: Sqlplus is my second home, part 8: Embedding multiple sqlplus arguments into one variable

≪ Previous: Public Appearances 2015

Here are the slides of a presentation I did at the IOUG Virtual Exadata conference in February. I’m explaining the basics of some new Oracle 12c things related to Exadata, plus current latest cellsrv improvements like Columnar Flash Cache and IO skipping for Min/Max retrieval using Storage Indexes:

Note that Christian Antognini and Roger MacNicol have written separate articles about some new features:

Enjoy!

↧

Sqlplus is my second home, part 8: Embedding multiple sqlplus arguments into one variable

March 29, 2015, 2:23 pm

≫ Next: Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

≪ Previous: Oracle Exadata Performance: Latest Improvements and Less Known Features

I’ve updated some of my ASH scripts to use these 4 arguments in a standard way:

What ASH columns to display (and aggregate by)
Which ASH rows to use for the report (filter)
Time range start
Time range end

So this means whenever I run ashtop (or dashtop) for example, I need to type in all 4 parameters. The example below would show top SQL_IDs only for user SOE sessions from last hour of ASH samples:

SQL> @ashtop sql_id username='SOE' sysdate-1/24 sysdate

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
     2271      .6   21% | 56pwkjspvmg3h 2015-03-29 13:13:16 2015-03-29 13:43:34               145
     2045      .6   19% | gkxxkghxubh1a 2015-03-29 13:13:16 2015-03-29 13:43:14               149
     1224      .3   11% | 29qp10usqkqh0 2015-03-29 13:13:25 2015-03-29 13:43:32               132
      959      .3    9% | c13sma6rkr27c 2015-03-29 13:13:19 2015-03-29 13:43:34               958
      758      .2    7% |               2015-03-29 13:13:16 2015-03-29 13:43:31                 1

When I want more control and specify a fixed time range, I can just use the ANSI TIMESTAMP (or TO_DATE) syntax:

SQL> @ashtop sql_id username='SOE' "TIMESTAMP'2015-03-29 13:00:00'" "TIMESTAMP'2015-03-29 13:15:00'"

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
      153      .2   22% | 56pwkjspvmg3h 2015-03-29 13:13:29 2015-03-29 13:14:59                 9
      132      .1   19% | gkxxkghxubh1a 2015-03-29 13:13:29 2015-03-29 13:14:59                 8
       95      .1   14% | 29qp10usqkqh0 2015-03-29 13:13:29 2015-03-29 13:14:52                 7
       69      .1   10% | c13sma6rkr27c 2015-03-29 13:13:31 2015-03-29 13:14:58                69
       41      .0    6% |               2015-03-29 13:13:34 2015-03-29 13:14:59                 1

Note that the arguments 3 & 4 above are in double quotes as there’s a space within the timestamp value. Without the double-quotes, sqlplus would think the script has total 6 arguments due to the spaces.

I don’t like to type too much though (every character counts!) so I was happy to see that the following sqlplus hack works. I just defined pairs of arguments as sqlplus DEFINE variables as seen below (also in init.sql now):

  -- geeky shorcuts for producing date ranges for various ASH scripts
  define     min="sysdate-1/24/60 sysdate"
  define  minute="sysdate-1/24/60 sysdate"
  define    5min="sysdate-1/24/12 sysdate"
  define    hour="sysdate-1/24 sysdate"
  define   2hours="sysdate-1/12 sysdate"
  define  24hours="sysdate-1 sysdate"
  define      day="sysdate-1 sysdate"
  define    today="TRUNC(sysdate) sysdate"

And now I can type just 3 arguments instead of 4 when I run some of my scripts and want some predefined behavior like seeing last 5 minutes’ activity:

SQL> @ashtop sql_id username='SOE' &5min

    Total
  Seconds     AAS %This   SQL_ID        FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ------------------- ------------------- -----------------
      368     1.2   23% | gkxxkghxubh1a 2015-03-29 13:39:34 2015-03-29 13:44:33                37
      241      .8   15% | 56pwkjspvmg3h 2015-03-29 13:40:05 2015-03-29 13:44:33                25
      185      .6   12% | 29qp10usqkqh0 2015-03-29 13:39:40 2015-03-29 13:44:33                24
      129      .4    8% | c13sma6rkr27c 2015-03-29 13:39:35 2015-03-29 13:44:32               129
      107      .4    7% |               2015-03-29 13:39:34 2015-03-29 13:44:33                 1

That’s it, I hope this hack helps :-)

By the way – if you’re a command line & sqlplus fan, check out the SQLCL command line “new sqlplus” tool from the SQL Developer team! (you can download it from the SQL Dev early adopter page for now).

↧

Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

April 24, 2015, 4:23 pm

≫ Next: Old ventures and new adventures

≪ Previous: Sqlplus is my second home, part 8: Embedding multiple sqlplus arguments into one variable

Vishal Desai systematically troubleshooted an interesting case where the initial symptoms of the problem showed a spike of enq: SQ – contention waits, but he dug deeper – and found the root cause to be quite different. He followed the blockers of waiting sessions manually to reach the root cause – and also used my @ash/ash_wait_chains.sql and @ash/event_hist.sql scripts to extract the same information more conveniently (note that he had modified the scripts to take AWR snap_ids as time range parameters instead of the usual date/timestamp):

http://vishaldesai.wordpress.com/2015/04/24/wait-chain-analysis-using-tanels-awesome-ash_wait_chains-script/

Definitely worth a read if you’re into troubleshooting non-trivial performance problems :)

↧

Old ventures and new adventures

June 18, 2015, 11:08 pm

≫ Next: The Hybrid World is Coming

≪ Previous: Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

I have some news, two items actually.

First, today (it’s still 18th June in California) is my blog’s 8th anniversary!

I wrote my first blog post, about Advanced Oracle Troubleshooting, exactly 8 years ago, on 18th June 2007 and have written 229 blog posts since. I had started writing and accumulating my TPT script collection a couple of years earlier and now it has over 1000 files in it! And no, I don’t remember what all of them do and even why I had written them. Also I haven’t yet created an index/documentation for all of them (maybe on the 10th anniversary? ;)

Thanks everyone for your support, reading, commenting and the ideas we’ve exchanged over all these years, it’s been awesome to learn something new every single day!

You may have noticed that I haven’t been too active in online forums nor blogging much in the last couple of years, which brings me to the second news item(s):

I’ve been heavily focusing on Hadoop. It is the future. It will win, for the same reasons Linux won. I moved to US over a year ago and am currently in San Francisco. The big data hype is the biggest here. Except it’s not hype anymore; and Hadoop is getting enterprise-ready.

I am working on a new startup. I am the CEO who still occasionally troubleshoots stuff (must learn something new every day!). We officially incorporated some months ago, but our first developers in Dallas and London have been busy in the background for over a year. By now we are beta testing with our most progressive customers ;-) We are going to be close partners with old and new friends in modern data management space and especially the awesome folks in Accenture Enkitec Group.

The name is Gluent. We glue together the old and new worlds in enterprise IT. Relational databases vs. Hadoop. Legacy ETL vs. Spark. SAN storage vs. the cloud. Jungles of data feeds vs. a data lake. I’m not going to tell you any more as we are still in stealth mode ;-)

Now, where does this leave Oracle technology? Well, I think it still kicks ass and it ain’t going away! In fact we are betting on it. Hadoop is here to stay, but your existing systems aren’t going away any time soon.

I wouldn’t want to run my critical ERP or complex transactional systems on anything other than Oracle. Want real time in-memory reporting on your existing Oracle OLTP system – with immediate consistency, not a multi-second lag: Oracle. Oracle is the king of complex OLTP and I don’t see it changing soon.

So, thanks for reading all the way to the end – and expect to hear much more about Gluent in the future! You can follow @GluentInc Twitter handle to be the first to hear any further news :-)

↧

The Hybrid World is Coming

June 29, 2015, 4:14 pm

≫ Next: RAM is the new disk – and how to measure its performance – Part 1 – Introduction

≪ Previous: Old ventures and new adventures

Here’s the video of E4 keynote we delivered together with Kerry Osborne a few weeks ago.

It explains what we see is coming, at a high level, from long time Oracle database professionals’ viewpoint and using database terminology (as the E4 audience is all Oracle users like us).

However, this change is not really about Oracle database world, it’s about a much wider shift in enterprise computing: modern Hadoop data lakes and clouds are here to stay. They are already taking over many workloads traditionally executed on in-house RDBMS systems on SAN storage arrays – especially all kinds of reporting and analytics. Oracle is just one of the many vendors affected by all this and they’ve also jumped onto the Hadoop bandwagon.

However, it would be naive to to think that Hadoop would somehow replace all your transactional or ERP systems or existing application code with thousands of complex SQL reports. Many of the traditional systems aren’t going away any time soon.

But the hybrid world is coming. It’s been a very good idea for Oracle DBAs to additionally learn Linux over the last 5-10 years, now is pretty much the right time to start learning Hadoop too. More about this in a future article ;-)

Check out the keynote video here:

The Hybrid World is Coming – E4 Keynote

Enjoy :-)

↧

RAM is the new disk – and how to measure its performance – Part 1 – Introduction

August 9, 2015, 4:26 pm

≫ Next: RAM is the new disk – and how to measure its performance – Part 2 – Tools

≪ Previous: The Hybrid World is Coming

RAM is the new disk, at least in the In-Memory computing world.

No, I am not talking about Flash here, but Random Access Memory – RAM as in SDRAM. I’m by far not the first one to say it. Jim Gray wrote this in 2006: “Tape is dead, disk is tape, flash is disk, RAM locality is king” (presentation)

Also, I’m not going to talk about how RAM is faster than disk (everybody knows that), but in fact how RAM is the slow component of an in-memory processing engine.

I will use Oracle’s In-Memory column store and the hardware performance counters in modern CPUs for drilling down into the low-level hardware performance metrics about CPU efficiency and memory access.

But let’s first get started by looking a few years into past into the old-school disk IO and index based SQL performance bottlenecks :)

Have you ever optimized a SQL statement by adding all the columns it needs into a single index and then letting the database do a fast full scan on the “skinny” index as opposed to a full table scan on the “fat” table? The entire purpose of this optimization was to reduce disk IO and SAN interconnect traffic for your critical query (where the amount of data read would have made index range scans inefficient).

This special-purpose approach would have benefitted your full scan in two ways:

In data warehouses, a fact table may contain hundreds of columns, so an index with “only” 10 columns would be much smaller. Full “table” scanning the entire skinny index would still generate much less IO traffic than the table scan, so it became a viable alternative to wide index range scans and some full table scans (and bitmap indexes with star transformations indeed benefitted from the “skinniness” of these indexes too).
As the 10-column index segment is potentially 50x smaller than the 500-column fact table, it might even fit entirely into buffer cache, should you decide so.

This is all thanks to physically changing the on-disk data structure, to store a copy of only the data I need in one place (column pre-projection?) and store these elements close to each other (locality).

Note that I am not advocating the use of this as a tuning technique here, but just explaining what was sometimes used to make a handful critical queries fast at the expense of the disk space, DML, redo and buffer cache usage overhead of having another index – and why it worked.

Now, why would I worry about this at all in a properly warmed up inmemory database, where the disk IO is not at the critical path of data retrieval at all? Well, now that we have removed the disk IO bottleneck, we inevitably hit the next slowest component as a bottleneck and this is … RAM.

Sequentially scanning RAM is slow. Randomly accessing RAM lines is even slower! Of course this slowness is all relative to the modern CPUs that are capable of processing billions of instructions per core every second.

Back to Oracle’s In-Memory column store example: Despite all the marketing talk about loop vectorization with CPU SIMD processing extensions, the most fundamental change required for “extreme performance” is simply about reducing the data traffic between RAM and CPUs.

This is why I said “SIMD would be useless if you waited on main memory all the time” at the Oracle Database In-Memory in Action presentation at Oracle OpenWorld (Oct 2014):

The “secret sauce” of Oracle’s in-memory scanning engine is the columnar storage of data, the ability to (de)compress it cheaply and accessing only the filtered columns’ memory first, before even touching any of the other projected columns required by the query. This greatly reduces the slow RAM traffic, just like building that skinny index reduced disk I/O traffic back in the on-disk database days. The SIMD instruction set extensions are just icing on the columnar cake.

So far this is just my opinion, but in the next part I will show you some numbers too!

↧

RAM is the new disk – and how to measure its performance – Part 2 – Tools

September 21, 2015, 1:20 am

≫ Next: Advanced Oracle Troubleshooting v2.5 (with 12c stuff too)

≪ Previous: RAM is the new disk – and how to measure its performance – Part 1 – Introduction

In the previous article I explained that the main requirement for high-speed in-memory data scanning is column-oriented storage format for in-memory data. SIMD instruction processing is just icing on the cake. Let’s dig deeper. This is a long post, you’ve been warned.

Test Environment

I will cover full test results in the next article in this series. First, let’s look into the test setup, environment and what tools I used for peeking inside CPU hardware.

I was running the tests on a relatively old machine with 2 CPU sockets, with 6-core CPUs in each socket (2s12c24t):

$ egrep "MHz|^model name" /proc/cpuinfo | sort | uniq -c
     24 cpu MHz		: 2926.171
     24 model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

The CPUs support SSE4.2 SIMD extensions (but not the newer AVX stuff):

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Even though the /proc/cpuinfo above shows the CPU clock frequency as 2.93GHz, these CPUs have Intel Turboboost feature that allows some cores run at up to 3.33GHz frequency when not all cores are fully busy and the CPUs aren’t too hot.

Indeed, the turbostat command below shows that the CPU core executing my Oracle process was running at 3.19GHz frequency:

# turbostat -p sleep 1
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             6.43 3.02 2.93   0  93.57   0.00   0.00   59   0.00   0.00
 0   0   0   4.49 3.19 2.93   0  95.51   0.00   0.00   46   0.00   0.00
 0   1   1  10.05 3.19 2.93   0  89.95   0.00   0.00   50
 0   2   2   2.48 3.19 2.93   0  97.52   0.00   0.00   45
 0   8   3   2.05 3.19 2.93   0  97.95   0.00   0.00   44
 0   9   4   0.50 3.20 2.93   0  99.50   0.00   0.00   50
 0  10   5 100.00 3.19 2.93   0   0.00   0.00   0.00   59
 1   0   6   6.25 2.23 2.93   0  93.75   0.00   0.00   44   0.00   0.00
 1   1   7   3.93 2.04 2.93   0  96.07   0.00   0.00   43
 1   2   8   0.82 2.15 2.93   0  99.18   0.00   0.00   44
 1   8   9   0.41 2.48 2.93   0  99.59   0.00   0.00   41
 1   9  10   0.99 2.35 2.93   0  99.01   0.00   0.00   43
 1  10  11   0.76 2.36 2.93   0  99.24   0.00   0.00   44

I will come back to this CPU frequency turbo-boosting later when explaining some performance metrics.

I ran the experiments in Oct/Nov 2014, so used a relatively early Oracle 12.1.0.2.1 version with a bundle patch (19189240) for in-memory stuff.

The test was deliberately very simple as I was researching raw in-memory scanning and filtering speed and was not looking into join/aggregation performance. I was running the query below with different hints and parameters to change access path options:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

I used the CUSTOMERS table of Swingbench Sales History schema. I deliberately didn’t use COUNT(*), but COUNT(col) on an actual column “cust_valid” that was nullable, so values in that actual column had to be accessed for correct counting.

Also, I picked the last column in the table as accessing columns in the physical “end” of a row (in row-oriented storage format) would cause more memory/cache accesses and CPU execution branch jumps due to the run-length encoded structure of a row in a datablock. Of course this depends on number of columns and width of the row too, plus hardware characteristics like cache line size (64 bytes on my machine).

Anyway, querying the last column helps to illustrate better what kind of overhead you may be suffering from when filtering that 500-column fact table using columns in the end of it.

SQL> @desc ssh.customers_nopart
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      CUST_ID                         NOT NULL NUMBER
    2      CUST_FIRST_NAME                 NOT NULL VARCHAR2(20)
    3      CUST_LAST_NAME                  NOT NULL VARCHAR2(40)
...
   22      CUST_EFF_TO                              DATE
   23      CUST_VALID                               VARCHAR2(1)

The table has 69,642,625 rows in it and its segment size is 1613824 blocks / 12608 MB on disk (actual used space in it was slightly lower due to some unused blocks in the segment). I set the table PCTFREE to zero to use all space in the blocks. I also created a HCC-compressed copy of the same table for comparison reasons.

SQL> @seg tanel.customers_nopart

  SEG_MB OWNER  SEGMENT_NAME               SEGMENT_TYPE      BLOCKS
-------- ------ -------------------------  ------------- ----------
   12608 TANEL  CUSTOMERS_NOPART           TABLE            1613824
    6416 TANEL  CUSTOMERS_NOPART_HCC_QL    TABLE             821248

I made sure that the test tables were completely cached in Oracle buffer cache to eliminate any physical IO component from tests and also enabled in-memory columnar caching for the CUSTOMERS_NOPART table.

SQL> @imseg %.%

    SEG_MB   INMEM_MB  %POP IMSEG_OWNER   IMSEG_SEGMENT_NAME  SEGMENT_TYPE  POP_ST
---------- ---------- ----- ------------- ------------------- ------------- ------
     12608       5913  100% TANEL         CUSTOMERS_NOPART    TABLE         COMPLE
---------- ----------
     12608       5913

CPU Activity Measurement Tools

In addition to the usual suspects (Oracle SQL Monitoring reports and Snapper), I used the awesome Linux tool called perf, but not in the typical way you might have used it in past.

On Linux, perf can be used for profiling code executing on CPUs by sampling the instruction pointer and stack backtraces (perf top), but also for taking snapshots of internal CPU performance counters (perf stat). These CPU performance counters (CPC) tell us what happened inside the CPU during my experiments.

This way we can go way deeper than the high level tools like top utility or getrusage() syscall would ever allow us to go. We’ll be able to measure what physically happened inside the CPU. For example: for how many cycles the CPU core was actually doing useful work pushing instructions further in the execution pipeline vs. was stalled, waiting for requested memory to arrive or for some other internal condition to come true. Also, we can estimate the amount of traffic between the CPU and main memory, plus CPU cache hits/misses at multiple cache levels.

Perf can do CPC snapshotting and accounting also at OS process level. This means you can measure the internal CPU/memory activity of a single OS process under examination and that was great for my experiment.

Note that these kinds of tools are nothing new, they’ve been around with CPU vendor code profilers ever since CPUs were instrumented with performance counters (but undocumented in the early days). Perf stat just makes this stuff easily accessible on Linux. For example, since Solaris 8, you could use cputrack for extracting similar process-level CPU counter “usage”, other platforms have their own tools.

I used the following command (-p specifies the target PID) for measuring internal CPU activity when running my queries:

perf stat -e task-clock,cycles,instructions,branches,branch-misses \
          -e stalled-cycles-frontend,stalled-cycles-backend \
          -e cache-references,cache-misses \
          -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
          -p 92106 sleep 30

In RHEL6 equivalents (and later) you can use perf stat -d option for getting similar detailed output without specifying all the counters separately – but I was on OEL5.8. Also, different CPU versions support different performance counters. Read the manuals and start from simpler stuff.

Below is an example output from one test run – where I ran a full table scan against the last column of a regular row-oriented table (all cached in buffer cache) and took a perf stat snapshot of the entire SQL execution. Note that even though the table was cached in Oracle’s in-memory column store, I had disabled its use with the NO_INMEMORY hint, so this full table scan was done entirely via traditional Oracle buffer cache (no physical IOs!):

 Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
     7,386,220,210 branches                  #  269.828 M/sec                   [39.99%]
        22,056,397 branch-misses             #    0.30% of all branches         [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]
       234,361,189 LLC-loads                 #    8.562 M/sec                   [26.66%]
       218,570,294 LLC-load-misses           #   93.26% of all LL-cache hits    [ 6.67%]
        18,493,582 LLC-stores                #    0.676 M/sec                   [ 6.67%]
         3,233,231 LLC-store-misses          #    0.118 M/sec                   [ 6.67%]
     7,324,946,042 L1-dcache-loads           #  267.589 M/sec                   [13.33%]
       305,276,341 L1-dcache-load-misses     #    4.17% of all L1-dcache hits   [20.00%]
        36,890,302 L1-dcache-prefetches      #    1.348 M/sec                   [26.66%]

      30.000601214 seconds time elapsed

I ran perf for 30 seconds for the above experiment, kicked it running just before executing the Oracle SQL and it finished right after the SQL had completed.

Let’s go through some of the above metrics – top down. I’m explaining these metrics at a fairly high level and in the context of my experiment – fully measuring a single SQL execution in a single Oracle process:

Basic CPU Performance Counter Reference

task-clock (~27373 milliseconds)
– This is a software event and shows how much the target Linux task (my Oracle process) spent running on CPU during the SQL execution, as far as the OS scheduler knows (roughly 27 seconds on CPU).
– So while perf took a 30 second snapshot of my process, my test SQL ran in a couple of seconds shorter time (so didn’t run on CPU all 30 seconds). That should explain the “0.912 CPU utilized” derived metric above.
.
cycles – (86B cycles)
– This hardware metric shows how many CPU cycles did my process (running a SQL statement) consume during perf runtime.
– dividing 86B CPU cycles with 27 CPU seconds gives that the CPU core must have operated at around 3.1 GHz speed (on average) during my SQL run.
– Remember, earlier in this article I used turbostat to show how these 2.93 GHz CPU cores happened to be running at 3.19 GHz frequency thanks to turbo-boost!
.
instructions – (32B instructions)
– This hardware metric shows how many instructions the CPU managed to successfully execute (and retire). This is where things get interesting:
– It’s worth mentioning that modern CPUs are superscalar and pipelined. They have multiple internal execution units, can have multiple instructions (decoded to µops) executed in its pipeline, memory loads & stores happening concurrently and possibly out-of-order – instruction level parallelism, data level parallelism.
– When dividing 32B executed instructions with 86B CPU cycles we see that we managed to execute only ~0.37 instructions per CPU cycle (IPC) on average!
– When inverting this number we get 86B/32B=2.69 Cycles Per Instruction (CPI). So, on average, every CPU instruction took ~2.69 CPU cycles to execute! We’ll get to the “why” part later.
.
branches – (7.3B branches)
– This hardware metric shows how many branches the execututed code took
– A branch is basically a jump (unconditional JMP instruction or a conditional jump like JZ, JNZ and many more – this is how basic IF/THEN/ELSE, CASE and various LOOP statmenents work at CPU level).
– The more decision-points in your code, the more branches it takes.
– Branches are like speedbumps in a CPU execution pipeline, obstructing the execution flow and prefetching due to the uncertainty of which branch will be taken.
– That’s why features like branch prediction with speculative execution are built into modern CPUs to alleviate this problem.
– Knowing that we scanned through roughly 70M rows in this table, this is over 100 branches taken per row scanned (and counted)!
– Oracle’s traditional block format rows are stored in run-length encoded format, where you know where the following column starts only after reading (and testing) the previous column’s length byte(s). The more columns you need to traverse, the more branches you’ll take per row scanned.
.
branch-misses – (22M, 0.3% of all branches)
– This hardware metric shows how many times the CPU branch predictor (described above) mispredicted which branch would be taken and causing a pipeline stall.
– Correctly predicting branches (where will the code execution jump next?) is good, as this allows to speculatively execute upcoming instructions and prefetch data required by them.
– However, the branch predictor doesn’t always predict the future correctly, despite various advancements in modern branch prediction, like branch history tables and branch target buffers etc.
– In a branch misprediction case, the mispredicted branch state has to be discarded, pipeline flushed and the correct branch’es instructions will be fetched & put into the start of execution pipeline (in short: mispredictions waste CPU cycles).
.
stalled-cycles-frontend – (~76.6M cycles, 88.7% of all cycles)
– This hardware metric shows for how many cycles the front-end of the CPU were spent stalled, not producing new µops into pipeline for backend execution.
– The front-end of an Intel pipelined CPU is basically the unit that fetches the good old x86/x86_64 instructions from L1 instruction cache (or RAM if needed), decodes these to RISC-like µops (newer CPUs also cache those µops) and puts these into backend instruction queue for execution.
– The front-end also deals with remembering taken brances and branch prediction (decoding and sending the predicted branch’es instructions into the backend).
– The frontend can stall due to various reasons, like instruction cache misses (waiting for memory lines containing instructions to arrive from RAM or a lower level cache), branch mispredictions or simply because the backend can not accept more instructions into its pipeline due to some bottlenecks there.
.
stalled-cycles-backend – (~58.6M cycles, 67.8% of all cycles)
– This hardware metric shows how many cycles in the back-end of the CPU were spent stalled instead of advancing the pipeline.
– The backend of the CPU is where actual computation on data happens – any computation referencing main memory (not only registers) will have to wait until the referenced memory locations have arrived in CPU L1 cache
– A common reason for backend stalls is due to waiting for a cache line to arrive from RAM (or a lower level cache), although there are many other reasons.
– To reduce memory-access related stalls, the program should be optimized to do less memory accesses or switch to more compact data structures to avoid loading data it doesn’t need.
– Simpler, more predictable data structures also help as the CPU hardware prefetcher may detect an “array scan” and start prefetching required memory lines in advance.
– In the context of this blog series – sequentially scanning and filtering a column of a table’s data is good for reducing memory access related CPU stalls. Walking through random pointers of linked lists (cache buffers chains) and skipping through row pointers in blocks, plus many columns’ length bytes before getting to your single column of interest causes memory access related stalls.
.
cache-references – (256M references)
– Now we get into a series of CPU cache traffic related metrics, some of these overlap
– This metric shows how many Last Level Cache accesses (both read and write) were done.
– The memory location that CPU tried to access was not in a higher level (L1/L2) cache, thus the lowest cache, Last Level Cache, was checked.
– Last Level Cache, also called LLC or Lower Level Cache or Longest Latency Cache is usually L3 cache on modern CPUs (although there are some hints that some perf versions still report L2 cache as LLC). I need to read some more perf source code to figure this out, but for this experiment’s purposes it doesn’t matter much if it’s L2 or L3. If I scan through a multi-GB table, it won’t fit into either level cache anyway.
.
cache-misses – (222M misses)
– This metric shows how many times the cache reference could not be satisified by the Last Level Cache and therefore RAM access was needed.
.
LLC-loads – (234M loads)
– The following 4 metrics just break down the above two in more detail.
– This metric shows how many times a cache line was requested from LLC as it wasn’t available (or valid) in a higher level cache.
.
LLC-load-misses – (218M misses)
– This metric shows how many LLC loads could not be satisfied from the LLC and therefore RAM access was needed.
.
LLC-stores – (18M stores)
– This metric shows how many times a cache line was written into a LLC.
.
LLC-store-misses – (3M misses)
– This metric shows how many times we had to first read the cache line into LLC before completing the LLC write.
– This may happen due to partial writes (for example: cache line size is 64 bytes and not currently present in LLC and the CPU tries to write into first 8 bytes of the line).
– This metric may get incremented due to other cache coherency related reasons where the store fails as other CPU(s) currently own the memory line and have locked and modified it since it was loaded into current CPU cache.
.
L1-dcache-loads – (7300M loads)
– The following 3 metrics are similar as above, but for the small (but fast) L1 cache.
– This metric shows how many times the CPU attempted to load a cache line from L1-data cache into a register.
– The dcache in the metric name means data accesses from memory (icache means instruction cache – memory lines fetched from L1I cache with instructions for execution).
– Note how the L1D cache loads metric is way higher than LLC-loads (7300M vs 234M) as many of the repeated tight loops over small internal memory structures can be satisfied from L1 cache.
.
L1-dcache-load-misses – (305M misses)
– This metric shows how many data cache loads from L1D cache couldn’t be satisfied from that cache and therefore a next (lower) level cache was needed.
– If you are wondering how come the L1D cache load misses is much larger than the LLC-loads (305M vs 234M – shouldn’t they be equal), one explanation is that as there’s a L2 cache between L1 & L3, some of the memory accesses got satisfied in L2 cache (and some more explanations illustrating the complexity of CPU cache metrics are here).
.
L1-dcache-prefetches – (37M prefetches)
– This metric shows how many cache lines the CPU prefetched as the L1D cache prefetch (DCU prefetcher).
– Usually this simple prefetcher just fetches the next cache line to the “currently” accessed one.
– It would be interesting to know if this prefetcher is smart enough to prefetch previous cache lines as regular row-formatted Oracle datablocks are filled from bottom up (this does not apply to the column-oriented stuff).
– If the full table scan code walks the block’s row directory so that it jumps to the bottom of the block first and works its way upwards, this means that some memory accesses will look like scanning backwards – and may affect prefetching.

I hope that this is a useful reference when measuring what’s going on inside a CPU. This is actually pretty basic stuff in the modern CPU world, there’s much more that you can measure in CPUs (via raw performance counters for example) and also different tools that you can use, like Intel VTune. It’s not trivial though, as at such low level even different CPU models by the same vendor may have different meaning (and numbering & flags) for their performance counters.

I won’t pretend to be a CPU & cache coherency expert, however these basic metrics and my understanding looks to be correct enough for comparing different Oracle access paths and storage formats (more about this in next parts of the series).

One bit of warning: It’s the best to run these experiments on a bare-metal server, not in a virtual machine. This is a low-level measurement exercise and in a VM you could suffer from all kinds of additional noise, plus some of the hardware counters would not be available for perf. Some hypervisors do not allow the guest OS to access hardware performance counters by default. One interesting article (by Frits Hoogland) about running perf in VMs is here.

Ok, enough writing for today! I actually started this post more than a month ago and it got way longer than planned. In the next part of this series I will interpret this post’s full-table scan SQL CPU metrics using the above reference (and explain where the bottleneck/inefficiency is). And in Part 4 I’ll show you all the metrics from a series of experiments – testing memory access efficiency of different Oracle data access paths (indexes vs full table scan vs HCC vs in-memory column store).

Update: I have corrected a couple of typos, we had 86B cpu cycles and 32B instructions instead of 86M/32M as I had mistakenly typed in before. The instructions-per-cycle ratio calculation and the point remains the same though. Thanks to Mark Farnham for letting me know.

↧

Advanced Oracle Troubleshooting v2.5 (with 12c stuff too)

October 9, 2015, 12:58 am

≫ Next: My Oracle OpenWorld presentations

≪ Previous: RAM is the new disk – and how to measure its performance – Part 2 – Tools

It took a while (1.5 years since my last class – I’ve been busy!), but I am ready with my Advanced Oracle Troubleshooting training (version 2.5) that has plenty of updates, including some more modern DB kernel tracing & ASH stuff and of course Oracle 12c topics!

The online training will take place on 16-20 November & 14-18 December 2015 (Part 1 and Part 2).

The latest TOC is below:

http://blog.tanelpoder.com/files/AOT25_description.pdf

Seminar registration details:

http://blog.tanelpoder.com/seminar/

A notable improvement of AOT v2.5: now attendees will get downloadable video recordings after the sessions for personal use! So, no crappy streaming with 14-day expiry date, you can download the video MP4 files straight to your computer or tablet and keep for your use forever!

I won’t be doing any other classes this year, but there will be some more (pleasant) surprises coming next year ;-)

See you soon!

NB! After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 16-20 November & 14-18 December 2015, so sign up now if you plan to attend this year!

↧

My Oracle OpenWorld presentations

October 23, 2015, 7:44 pm

≫ Next: Connecting Hadoop and Oracle

≪ Previous: Advanced Oracle Troubleshooting v2.5 (with 12c stuff too)

Oracle OpenWorld is just around the corner – I will have one presentation at OOW this year and another at the independent OTW event:

Connecting Oracle with Hadoop

Conference: OakTableWorld
Time: Monday, 26 Oct, 4:00pm
Location: Creativity Museum
Abstract: Check agenda

Real-Time SQL Monitoring in Oracle Database 12c

Conference: OpenWorld
Time: Wednesday, 28 Oct, 3:00pm
Location: Moscone South 103
Abstract: Click here (sign up to guarantee a seat!)

I plan to hang out at the OTW venue on Monday and Tuesday, so see you there!

↧

Connecting Hadoop and Oracle

October 27, 2015, 5:06 pm

≫ Next: SQL Monitoring in Oracle Database 12c

≪ Previous: My Oracle OpenWorld presentations

Here are the slides of my yesterday’s OakTableWorld presentation. They also include a few hints about what our hot new venture Gluent is doing (although bigger annoucements come later this year).

[direct link]

Also, if you are at Oracle OpenWorld right now, my other presentation about SQL Monitoring in 12c is tomorrow at 3pm in Moscone South 103. See you there!

↧

SQL Monitoring in Oracle Database 12c

October 29, 2015, 11:53 am

≫ Next: Troubleshooting Another Complex Performance Issue – Oracle direct path inserts and SEG$ contention

≪ Previous: Connecting Hadoop and Oracle

Here’s my latest OOW presentation – SQL Monitoring in Oracle Database 12c:

[direct link]

You can download all my scripts from http://blog.tanelpoder.com/files/

↧

Troubleshooting Another Complex Performance Issue – Oracle direct path inserts and SEG$ contention

November 10, 2015, 4:35 pm

≫ Next: My New Youtube Channel

≪ Previous: SQL Monitoring in Oracle Database 12c

Here’s an updated presentation I first delivered at Hotsos Symposium 2015.

It’s about lots of concurrent PX direct path insert ant CTAS statements that, when clashing with another bug/problem, caused various gc buffer busy waits and enq: TX – allocate ITL entry contention. This got amplified thanks to running this concurrent workload on 4 RAC nodes:

When reviewing these slides, I see there’s quite a lot that needs to be said in addition to what’s on slides, so this might just mean a (Powerpoint) hacking session some day!

↧

My New Youtube Channel

November 23, 2015, 8:30 pm

≫ Next: RAM is the new disk – and how to measure its performance – Part 3 – CPU Instructions & Cycles

≪ Previous: Troubleshooting Another Complex Performance Issue – Oracle direct path inserts and SEG$ contention

I have created a new youtube channel – and have uploaded some videos there already! Bookmark & Subscribe here:

https://www.youtube.com/c/tanelpodertech

More stuff is coming over the next weeks & months :-)

NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

↧

RAM is the new disk – and how to measure its performance – Part 3 – CPU Instructions & Cycles

November 29, 2015, 10:45 pm

≫ Next: Gluent launch! New production release, new HQ, new website!

≪ Previous: My New Youtube Channel

If you haven’t read the previous parts of this series yet, here are the links: [ Part 1 | Part 2 ].

A Refresher

In the first part of this series I said that RAM access is the slow component of a modern in-memory database engine and for performance you’d want to reduce RAM access as much as possible. Reduced memory traffic thanks to the new columnar data formats is the most important enabler for the awesome In-Memory processing performance and SIMD is just icing on the cake.

In the second part I also showed how to measure the CPU efficiency of your (Oracle) process using a Linux perf stat command. How well your applications actually utilize your CPU execution units depends on many factors. The biggest factor is your process’es cache efficiency that depends on the CPU cache size and your application’s memory access patterns. Regardless of what the OS CPU accounting tools like top or vmstat may show you, your “100% busy” CPUs may actually spend a significant amount of their cycles internally idle, with a stalled pipeline, waiting for some event (like a memory line arrival from RAM) to happen.

Luckily there are plenty of tools for measuring what’s actually going on inside the CPUs, thanks to modern processors having CPU Performance Counters (CPC) built in to them.

A key derived metric for understanding CPU-efficiency is the IPC (instructions per cycle). Years ago people were actually talking about the inverse metric CPI (cycles per instruction) as on average it took more than one CPU cycle to complete an instruction’s execution (again, due to the abovementioned reasons like memory stalls). However, thanks to today’s superscalar processors with out-of-order execution on a modern CPU’s multiple execution units – and with large CPU caches – a well-optimized application can execute multiple instructions per a single CPU cycle, thus it’s more natural to use the IPC (instructions-per-cycle) metric. With IPC, higher is better.

Here’s a trimmed snippet from the previous article, a process that was doing a fully cached full table scan of an Oracle table (stored in plain old row-oriented format):

Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]

      30.000601214 seconds time elapsed

The IPC of the above task is pretty bad – the CPU managed to complete only 0.37 instructions per CPU cycle. On average every instruction execution was stalled in the execution pipeline for 2.39 CPU cycles.

Note: Various additional metrics can be used for drilling down into why the CPUs spent so much time stalling (like cache misses & RAM access). I covered the typical perf stat metrics in the part 2 of this series so won’t go in more detail here.

Test Scenarios

The goal of my experiments was to measure the number CPU-efficiency of different data scanning approaches in Oracle – on different data storage formats. I focused only on data scanning and filtering, not joins or aggregations. I ensured that everything would be cached in Oracle’s buffer cache or in-memory column store for all test runs – so disk IO was not a factor here (again, read more about my test environment setup in part 2 of this series).

The queries I ran were mostly variations of this:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Although I was after testing the full table scanning speeds, I also added two examples of scanning through the entire table’s rows via index range scans. This allows me to show how inefficient index range scans can be when accessing a large part of a table’s rows even when all is cached in memory. Even though you see different WHERE clauses in some of the tests, they all are designed so that they go through all rows of the table (just using different access patterns and code paths).

The descriptions of test runs should be self-explanatory:

1. INDEX RANGE SCAN BAD CLUSTERING FACTOR

SELECT /*+ MONITOR INDEX(c(cust_postal_code)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_postal_code > '0';

2. INDEX RANGE SCAN GOOD CLUSTERING FACTOR

SELECT /*+ MONITOR INDEX(c(cust_id)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_id > 0;

3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)

SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0

SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

5. FULL TABLE SCAN IN MEMORY WITHOUT WHERE CLAUSE

SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c;

6. FULL TABLE SCAN VIA BUFFER CACHE OF HCC QUERY LOW COLUMNAR-COMPRESSED TABLE

SELECT /*+ MONITOR */ COUNT(cust_valid) 
FROM customers_nopart_hcc_ql WHERE cust_id > 0

Note how all experiments except the last one are scanning the same physical table just with different options (like index scan or in-memory access path) enabled. The last experiment is against a copy of the same table (same columns, same rows), but just physically formatted in the HCC format (and fully cached in buffer cache).

Test Results: Raw Numbers

It is not enough to just look into the CPU performance counters of different experiments, they are too low level. For the full picture, we also want to know how much work (like logical IOs etc) the application was doing and how many rows were eventually processed in each case. Also I verified that I did get the exact desired execution plans, access paths and that no physical IOs or other wait events happened using the usual Oracle metrics (see the log below).

Here’s the experiment log file with full performance numbers from SQL Monitoring reports, Snapper and perf stat:

http://blog.tanelpoder.com/files/scripts/im/inmemory_cpu_counters.txt

I also put all these numbers (plus some derived values) into a spreadsheet. I’ve pasted a screenshot of the data below for convenience, but you can access the entire spreadsheet with its raw data and charts here (note that the spreadsheet has multiple tabs and configurable pivot charts in it):

https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqAAlHnZqmuVmSFbHMLDsjaU/edit?usp=sharing

Raw perf stat data from the experiments:

oracle scan test results.png

Now let’s plot some charts!

Test Results: CPU Instructions

Let’s start from something simple and gradually work our way deeper. I will start from listing the task-clock-ms metric that shows the CPU time usage of the Oracle process in milliseconds for each of my test table scans. This metric comes from the OS-level and not from within the CPU:

CPU time used for scanning the dataset (in milliseconds)

As I mentioned earlier, I added two index (full) range scan based approaches for comparison. Looks like the index-based “full table scans” seen in first and second columns are using the most CPU-time as the OS sees it (~120 and close to 40 seconds of CPU respectively).

Now let’s see how many CPU instructions (how much work “requested” from CPU) the Oracle process executed for scanning the same dataset using different access paths and storage formats:

oracle table scan instructions clean.png — CPU instructions executed for scanning the dataset

Wow, the index-based approaches seem to be issuing multiple times more CPU instructions per query execution than any of the full table scans. Whatever loops the Oracle process is executing for processing the index-based query, it runs more of them. Or whatever functions it calls within those loops, the functions are “fatter”. Or both.

Let’s look into an Oracle-level metric session logical reads to see how many buffer gets it is doing:

oracle buffer gets clean.png — Buffer gets done for a table scan

Wow, using the index with bad clustering factor (1st bar) causes Oracle to do over 60M logical IOs, while the table scans do around 1.6M of logical IOs each. Retrieving all rows of a table via an index range scan is super-inefficient, given that the underlying table size is only 1613824 blocks.

This inefficiency is due to index range scans having to re-visit the same datablocks multiple times (up to one visit per row, depending on the clustering factor of the index used). This would cause another logical IO and use more CPU cycles for each buffer re-visit, except in cases where Oracle has managed to keep a buffer pinned since last visit. The index range scan with a good clustering factor needs to do much fewer logical IOs as given the more “local” clustered table access pattern, the re-visited buffers are much more likely found already looked-up and pinned (shown as the buffer is pinned count metric in V$SESSTAT).

Knowing that my test table has 69,642,625 rows in it, I can also derive an average CPU instructions per row processed metric from the total instruction amounts:

instructions per row.png

The same numbers in tabular form:

Screen Shot 2015-11-30 at 00.38.12

Indeed there seem to be radical code path differences (that come from underlying data and cache structure differences) that make an index-based lookup use thousands of instructions per row processed, while an in-memory scan with a single predicate used only 102 instructions per row processed on average. The in-memory counting without any predicates didn’t need to execute any data comparison logic in it, so could do its data access and counting with only 43 instructions per row on average.

So far I’ve shown you some basic stuff. As this article is about studying the full table scan efficiency, I will omit the index-access metrics from further charts. The raw metrics are all available in the raw text file and spreadsheet mentioned above.

Here are again the buffer gets of only the four different full table scan test cases:

oracle buffer gets table scan only.png — Buffer gets done for full table scans

All test cases except the HCC-compressed table scan cause the same amount of buffer gets (~1.6M) as this is the original table’s size in blocks. The HCC table is only slightly smaller – didn’t get great compression with the query low setting.

Now let’s check the number CPU instructions executed by these test runs:

oracle table scan only instructions.png — CPU instructions executed for full table scans

Wow, despite the table sizes and number of logical IOs being relatively similar, the amount of machine code the Oracle process executes is wildly different! Remember, all that my query is doing is just scanning and filtering the data followed with a basic COUNT(column) operation – no additional sorting, joining is done. The in-memory access paths (column 3 & 4) get away with executing much fewer CPU instructions than the regular buffered tables in row-format and HCC format (columns 1 & 2 in the chart).

All the above shows that not all logical IOs are equal, depending on your workload and execution plans (how many block visits, how many rows extracted per block visit) and underlying storage formats (regular row-format, HCC in buffer cache or compressed columns in In-Memory column store), you may end up doing a different amount of CPU work per row retrieved for your query.

This was true before the In-Memory option and even more noticeable with the In-Memory option. But more about this in a future article.

Test Results: CPU Cycles

Let’s go deeper. We already looked into how many buffer gets and CPU instructions the process executed for the different test cases. Now let’s look into how much actual CPU time (in form of CPU cycles) these tests consumed. I added the CPU cycles metric to instructions for that:

instructions and cycles.png — CPU instructions and cycles used for full table scans

Hey, what? How come the regular row-oriented block format table scan (TABLE BUFCACHE) takes over twice more CPU cycles compared to its instructions executed?

Also, how come all the other table access methods use noticeably less CPU cycles than the number of instructions they’ve executed?

If you paid attention to this article (and previous ones) you’ll already know why. In the 1st example (TABLE BUFCACHE) the CPU must have been “waiting” for something a lot, instructions having spent multiple cycles “idle”, stalled in the pipeline, waiting for some event or necessary condition to happen (like a memory line arriving from RAM).

For example, if you are constantly waiting for the “random” RAM lines you want to access due to inefficient memory structures for scanning (like Oracle’s row-oriented datablocks), the CPU will be bottlenecked by RAM access. The CPU’s internal execution units, other than the load-store units, would be idle most of the time. The OS top command would still show you 100% utilization of a CPU by your process, but in reality you could squeeze much more out of your CPU if it didn’t have to wait for RAM so much.

In the other 3 examples above (columns 2-4), apparently there is no serious RAM (or other pipeline-stalling) bottleneck as in all cases we are able to use the multiple execution units of modern superscalar CPUs to complete more than one instruction per CPU cycle. Of course more improvements might be possible, but more about this in a following post.

For now I’ll conclude this (lengthy) post with one more chart with the fundamental derived metric instructions per cycle (IPC):

instructions per cycle.png

The IPC metric is derived from the previously shown instructions and CPU cycles metrics by a simple division. Higher IPC is better as it means that your CPU execution units are more utilized, it gets more done. However, as IPC is a ratio, you should never look into the IPC value alone, always look into it together with instructions and cycles metrics. It’s better to execute 1 Million instructions with IPC of 0.5 than 1 Billion instructions with an IPC of 3 – but looking into IPC in isolation doesn’t tell you how much work was actually done. Additionally, you’d want to use your application level metrics that give you an indication of how much application work got done (I used Oracle’s buffer gets and rows processed metrics for this).

Looks like there’s at least 2 more parts left in this series (advanced metrics and a summary), but let’s see how it goes. Sorry for any typos, it’s getting quite late and I’ll fix ’em some other day :)

↧

Gluent launch! New production release, new HQ, new website!

December 4, 2015, 10:23 am

≫ Next: My BIWA Summit Presentations

≪ Previous: RAM is the new disk – and how to measure its performance – Part 3 – CPU Instructions & Cycles

I’m happy to announce that the last couple of years of hard work is paying off and the Gluent Offload Engine is production now! After beta testing with our early customers, we are now out of complete stealth mode and are ready talk more about what exactly are we doing :-)

Check out our new website and product & use case info here!

http://gluent.com

http://twitter.com/GluentInc

We are hiring! Need to fill that new Dallas World HQ ;-) Our distributed teams around the US and in London need more helping hands (and brains!) too.

You’ll be hearing more of us soon :-)

Paul & Tanel just moved in to Gluent World HQ

↧

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Test Environment

CPU Activity Measurement Tools

Basic CPU Performance Counter Reference

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

A Refresher

Test Scenarios

Test Results: Raw Numbers

Test Results: CPU Instructions

Test Results: CPU Cycles

Related Posts

Related Posts