Postgres Table Sizes

Sometimes it’s handy to know the tables in your DB that are taking up the most space. Here’s a quick query that returns the table names and pages used in descending order:


SELECT table_name, relpages
FROM information_schema.TABLES
INNER JOIN pg_class
ON relname=table_name
WHERE table_schema='public'
ORDER BY relpages DESC

 

The reason for the join on the information schema is to limit the results to just the tables you created. If you just use pg_class, then you also get the sizes of indexes and views including the postgres owned objects.

Leave a Comment

Time Limiting With SQL

When writing SQL queries it’s important to keep in mind the cost of the functions and operators you’re using. For example, using

date_trunc('day', timestamp_column) = '$day'
 
is much slower than using
timestamp_column BETWEEN '$day' AND '$day' + interval '1 day'
 

It may seem obvious but sometimes it’s good to experiment, even doing stupid things just to know what the impacts are.

I was writing a query today to get log data limited to just a certain day. The table in question has a “timestamp without timezone” type column called create_time. I wanted to write a query to give me all the rows that matched a certain day given as a string e.g. ’2011-05-06’. Usually I would use the BETWEEN operator on the create_time column like this:


SELECT blah FROM log
WHERE create_time
    BETWEEN '$day' AND timestamp '$day' + interval '1 day'

 

But I’ve been doing a bit of date munging using date_trunc so the first attempt at hacking up this query I had written:


SELECT blah FROM log
WHERE date_trunc('day', create_time) = '$day'

 

While the second version looks a bit neater, it actually has very crappy performance because it involves doing a conversion before each comparison where as the first version using BETWEEN only has to do a comparison (It does a couple of conversions once at the start of processing to turn the strings into timestamps).

The time difference between these two queries is <1ms vs 700ms BTW so the first version is definitely the right choice.

The next query has got me a little more stumped though: I need to get a list of the days for which data exists. The query I’m using is:


SELECT cast(create_time AS date) AS day FROM log
GROUP BY day
ORDER BY day

 

Which takes over a second to run. I don’t currently have the option of adding an index to this table on this particular server so I just have to put up with the very bad performance of this query. Or is there some trick that will get me a faster result?

Leave a Comment

CodeIgniter Database Memory Use In Long Running Scripts

PHP and associated frameworks aren’t optimised for running long scripts so when your app needs to do a lot of data crunching, there are often memory problems. In this case, I discovered CodeIgniter was keeping all my queries tucked away in its internal structures. Apparently it does this so you can access profiling information when the script finishes.

I was working on my hacky app for analysing a log file today when I loaded in a biggish log file and hit the dreaded memory limit exceeded problem:

PHP Fatal error: Allowed memory size of 33554432 bytes exhausted

I googled and stackoverflowed for PHP memory profiling solutions and pretty much saw only two options: xdebug and the built-in memory_get_usage(). I considered using xdebug but in this case it would mean installing it on a production server and potentially breaking things or else copying some big files to the office and trying to reproduce the problem on our development server which would also be a big hassle. I was sure I could find the issue more easily than that. Using memory_get_usage in a log statement, I was able to get some clues as to what was going on but it’s hard to use this as well because while I could see where the memory was growing, I couldn’t tell if the large arrays were being garbage collected at a later stage (which I’m pretty sure they were).

The way the log file processing works is like this. I iterate through the file a line at a time and grep for markers that let me see what type of log message it is. Depending on the grep result, I call into the appropriate method for parsing the log message and storing the result into the matching array for that log message type. Once any of these arrays gets to a certain number of elements, I call into the method which creates a bulk insert to the DB and clears out the array.

As best as I could tell, the big arrays would have been marked for garbage collection at this point in my method:


$this->tagset_events = array();
/* the former contents of tagset_events should
now be free for garbage collection */


 

After a bit of reading I came across this StackOverflow post which mentioned the database library and that got me thinking about whether CodeIgniter was keeping my queries hanging around and using up all my memory.

As a first experiment, I changed my bulk insert calls to use simple_query() instead of query() and ran my script again. The results were immediately a success! The script finished running and the memory use reported by the profiler was only 3MB as opposed to 32MB. From this I am concluding that CodeIgniter does keep your SQL queries tucked away somewhere by default which can use up memory if you’re processing a lot of queries (in this case it was in the order of 3000 queries each seeming to use about 8K of memory)

Leave a Comment

Optimising and Profiling a CodeIgniter App

CodeIgniter has a handy profiling class which automatically times queries and can be used to time blocks of code. There is no need to live with the pain of a slow app when often you can get improvements in the order of 100 times faster just by adding an index.

I have a hacked together CodeIgniter (aka CI) based web app here at work that I use to analyse some logfiles. The basic workflow of this app is to grep through the log files looking for certain events and create database rows for these events that I can then run queries against and produce a graphical representation of what’s going on (I use SVG for the graphics but I’m not going to talk about that in this post).

Since I wrote this app, I’ve had abysmally bad performance which has gradually worsened as I’ve added more log data to the system. Today it got bad enough that I decided to bite the bullet and sort the mess out.

My first attempt to locate the issue was to place log comments with time stamps at various points in the code. By looking at the timestamps on these messages in the CodeIgniter log, I was able to locate the problem methods in my app. It turned out I had two methods that were causing the issues. I picked the first of these methods and put some debug messages in the inner for loop to see if I could break things down further but unfortunately, the granularity of the log timestamps is only one second so I couldn’t get any further using this method unless I started rolling my own timing methods.

Rather than rolling my own at this point, I decided to investigate the CodeIgnitor profiling options, assuming it had some built in help. It turned out that yes indeed CI has a cool and useful profiling class built in. Not only does it automatically print the timings of your SQL queries, and overall page load time, it lets you mark points in the code and print out the execution times of sections of code (these are called benchmark points in CI). I broke my problem loop into three timed sections and enabled the profiling as per the CodeIgniter User Manual

At this point I saw that two of the sections in my loop had ~300ms execution time. I also noticed in the SQL query section that the queries launched by these methods also had ~300ms execution time so I knew that these queries were the problem as they were being called inside a for loop so the 300ms was quickly multiplying out to tens of seconds in my total page load time.

The query in question looked like this:


SELECT value FROM niotag_events
WHERE logfile_id = 14 AND uri='our://funny/url/scheme/id/thing'
AND evtime < '2011-04-29 11:02:34.345' AND resource='M123B'
ORDER BY evtime DESC LIMIT 1

 

As It turns out, I hadn’t put any indexes on my tables so it made sense that the queries were running slow. I’m not a database optimisation expert so I just decided to add an index that covered the columns in my where clause:


CREATE INDEX niotag_events_idx_1
  ON niotag_events
  USING btree
  (logfile_id, resource, uri, evtime DESC);

 

Just this index reduced the query to less than a millisecond and I immediately felt the page load speed halve. I then tackled three other queries on different tables using the same method with similar improvements.

The upshot of this exercise was that I saw my page load time go from 25 seconds to about 250ms, that’s a factor of a 100 times speed increase. Think about the minutes I was spending waiting for results of my log crunching app while I could have been clicking through to the next page to find what I was looking for. Think of the amount of boredom that was causing me and even worse, the windows for distraction as I clicked over to twitter or elsewhere just waiting for pages to load. This exercise was well worth the effort and my only question I have now is why didn’t I fix this sooner?

Leave a Comment

Counting Multiple Statistics in One Query

When constructing reports of grouped time series data from SQL tables, you can use inner joins and sub queries to make up the results tables you need and avoid collating the result arrays in PHP.

In a previous post, I mentioned that I was not entirely happy with the way I had collated the results of a trend history view for an error reporting app I was hacking together. The problem was that I needed to have a count of error and count of total throughput grouped by time so I could show the error percentage and raw error counts changing over time. I couldn’t find a way to tell SQL to count total throughput and just the errors in one query. I was thinking about this in the shower (which is where the best ideas always form) and I realised I could use sub queries and inner joins to get the data and avoid the messy array operations that I was doing in PHP. Here’s my query now:


SELECT
            a.tdlday, a.total, b.count
            FROM  (
                SELECT date_trunc('day', create_time) AS tdlday,
                    count(*) AS total
                FROM error_log
                WHERE create_time BETWEEN NOW() - interval '14 days' AND NOW()
                GROUP BY tdlday
                ) AS a
            INNER JOIN (
                SELECT date_trunc('day', create_time) AS tdlday2,
                    count(*) AS count
                FROM error_log
                WHERE create_time BETWEEN NOW() - interval '14 days' AND NOW()
                    AND $error_criteria
                GROUP BY tdlday2
                ) AS b ON b.tdlday2=a.tdlday
            ORDER BY tdlday ASC

 

As you can see, the query is two sub queries with an inner join. The first sub query gives me the totals and the second gives me the error count. The join is on the date group. This allowed me to remove the second query from my PHP, five lines of expensive array rejiggering and the expensive lookups in the view when calculating the percentages.

Leave a Comment

Quick and dirty bar graphs using PHP and CSS

You don’t have to load in bitmap / SVG / javascript based graphing libraries to add visual queues to your reports. You can get a long way using CSS.

I’ve recently been delving into some data analysis for a customer to get trends of error codes from a field device. The customer was concerned that the error rate is increasing. These things are sometimes hard to visualise because there is plenty of variation from day to day. For the sake of keeping my blog active I thought I’d note a few things and if anyone out there comes across this, maybe you could give some thoughts in the comments.

I decided to make a quick and dirty web app to visualise the data. I chose CodeIgniter because it seems to be pretty popular and I’ve worked with it before. CodeIgniter uses an MVC approach but in this case I haven’t much use for the ‘M’ straight up so I start with a controller that will give me the error percentages for each of the error codes over the last 24 hours. Here’s how I do the query:


SELECT error_code, count(*)
          FROM error_log
          WHERE create_time BETWEEN NOW() - interval '1 day' AND NOW()
          GROUP BY error_code
          ORDER BY error_code

 

(I’m using postgres BTW and haven’t paid too much attention to portability because this is a hack)

In the view, I print a table of the error codes and the counts for the last 24 hour period. The last column of the table is a basic bar graph which helps to immediately pick out the red flags.

I write the text for the table using the rows returned by my query which is simple enough, open a table tag, the do a foreach loop over the result rows generating and tags as approriate and close the table after the loop. The percentages are being calculated against the total throughput of items which I obtained in a separate query.

To draw the last column, I did something a little special. First for the sake of scale, I iterated through the array earlier to obtain the maximum percent value (I also incidentally used this loop to get the value for the total row):


$error_total = 0;
$bar_max = 0;
foreach ($rows as $row) {
    $error_total += $row['count'];
    $bar_max = max($row['count'], $bar_max);
}

 

Inside the main foreach loop, I make use of a little helper I wrote:


<td><?php echo bar($row['count']/$bar_max); ?></td>

 

The helper looks like this:


<?php                                 
function bar($fraction)
{       
   $res = '<div class="outerbar"><div class="bar" style="width:';
   $res .= round($fraction*100);
   $res .= '%;" >&nbsp;</div></div>';
   return $res;                       
}           
function vbar($fraction)
{
   $res = '<div class="outervbar"><div class="vbarspace" style="height:';
   $res .= round((1-$fraction)*100);
   $res .= '%;" >&nbsp;</div><div class="vbar" style="height:';
   $res .= round($fraction*100);
   $res .= '%;">&nbsp;</div></div>';
   return $res;
}       
?>     

 

(Don’t forget you have to load helpers with a $this->load->helper(‘bar_helper’);)

The ‘bar’ function creates a couple of nested divs and uses the style element to set the width of the bar. In the CSS you just need to style the outer bar to have the width you want and the inner bar to be the colour you want. Like this:


.outerbar {
    width: 200px;
}
.bar {
    background-color: #f00;
}

 

Some readers may notice that my table also links to trends and that the bar_helper file also had a vbar function. The trends were slightly more difficult. They look like this:

I had to do a bit more work to get the trends right. First the query:


SELECT
            date_trunc('day', create_time) AS tdlday, count(*)
            FROM error_log
            WHERE create_time BETWEEN NOW() - interval '14 days' AND NOW()
                AND error_code = '$trend_variable'
            GROUP BY tdlday
            ORDER BY tdlday ASC

 

This query gives me an array with a column of 14 days and a column of error counts. I did a second query to get the total throughputs for the day. Things got a bit weird because on weekends we had a little bit of throughput but no errors so the two arrays had different numbers of rows. I got around this in a fairly expensive way by turning the second query result into a keyed array. This seems pretty expensive and if I knew a way to do all this in one SQL query or simpler PHP it would be worth doing:


$totals_keys = array_map(
            create_function('$v', 'return $v["tdlday"];'), $totals);
        $totals_values = array_map(
            create_function('$v', 'return $v["count"];'), $totals);
        $data['totals'] = array_combine($totals_keys, $totals_values);

 

You can see I did two array_maps to split out the columns and then combined the arrays back into a keyed array. Now in the view, to access the throughput total for a day I need to do $totals[$errors[$ii][‘tdldate’]].

Another complexity of the view is that with the data in columns but the page view in rows, I had to iterate through the array once for each row. I came up with a dodgy way to save myself writing the loop four times by using an array lambdas, once for each row:


$row_lambdas = array(
    array('Bar',
        create_function('$ii,$errors,$totals,$max_bar',
            '$pc = $errors[$ii][\'count\']
                / $totals[$errors[$ii][\'tdlday\']] * 100;
            return vbar($pc/$max_bar);'
)
    ),

    array('Dates',
        create_function('$ii,$errors,$totals,$max_bar',
            'return substr($errors[$ii][\'tdlday\'],5,5);')
    ),

    array('Percent',
        create_function('$ii,$errors,$totals,$max_bar',
            '$pc = $errors[$ii][\'count\']
                / $totals[$errors[$ii][\'tdlday\']] * 100;
            return sprintf("%0.1f", $pc);'
)
    ),
    array('Throughput Totals',
        create_function('$ii,$errors,$totals,$max_bar',
            'return $totals[$errors[$ii][\'tdlday\']];')
    ),
    );
?>
<?php
foreach ($row_lambdas as $row_lambda)
{
    ?>
    <tr><td><?php echo $row_lambda[0];?></td>
     <?php
    for ($ii = 0; $ii < count($errors); ++$ii) {
        ?>
        <td><?php echo
           call_user_func($row_lambda[1],$ii,$errors,$totals,$max_bar); ?></td>
        <?php
    }
    ?>
</tr>
    <?php
}
?>

 

I know newer versions of PHP support closures instead of create_function but this is an oldish server.

The final trick to this little application was that when I did the vertical bar graph, the bars wanted to grow from the top of the cells down rather than growing up from the bottom. If you look at the vbar function above, you’ll see I use two inner divs to make the bars so that I fill the whole outer bar. The first bar is actually the space above the top of the bar and then the second bar is the actual coloured part of the bar.

Comments (1)