About Picalo

avatar

Welcome to the world of Picalo, a collaborative, open-source effort to produce a data analysis application suitable for auditors, fraud examiners, data miners, and other data analysts.
 
Example uses of Picalo:
  - Analyzing financial data, employee records, and purchasing systems for errors and fraud
  - Importing Excel, XML, EBCDIC, CSV, and TSV files into databases
  - Interactively analyzing network events, web server logs, and system login records
  - Importing email into relational or text-based databases
  - Embedding controls and fraud testing routines into production systems

 

Crash Project

    Thursday, May 26th, 20117 Commented
    Categorized Under: Uncategorized

    I’m about 80 percent done with Picalo 5, and I need to take a break for a few months because of a crash project that was thrown on me this week. Hopefully the project will be done in a few weeks, but realistically, it might be a few months. Then back into tying the new Picalo GUI to the back end (that is already available on PyPI).

    Download from PyPI

      Wednesday, March 16th, 20114 Commented
      Categorized Under: Uncategorized

      For any using the text-based version of Picalo (i.e. the libraries within Python programs), the next version is released on PyPI under “picalo”.  In other words, you can simply run “easy_install picalo” or download from http://pypi.python.org/pypi/picalo/.

      I’m still working on integrating the libraries into the GUI.  It’s a big change, so please be patient.  It’s going slower than I had hoped, but I’m still hopeful for an April or at least May release.

      GUI Progress

        Thursday, March 3rd, 20112 Commented
        Categorized Under: Uncategorized

        I’ve modified the GUI to work with the new data structures.  I’m now working on the object tree (the left-side list on the GUI) — it needs a lot of changes to support the new Project objects.

        Projects are where Picalo will now save all tables, database connection information, and queries.  The .pco, .pcd, and .pcq formats are not supported anymore.  It’s a big change, but the advantages of putting things into the Project (.pcp) file outweigh the transition problems we’ll have.  I think you’ll really be impressed with the performance of Picalo on huge datasets.

        Moving to the GUI

          Wednesday, February 16th, 2011one Commented
          Categorized Under: Uncategorized

          I finished programming and testing the data structures this week.  Picalo’s Tables, Records, and Columns are ready to go with the new ZODB structure.  It works great and seems to be bug-free, but I’m sure we’ll find a few more bugs when more people start using it.  I added a lot of new tests to the TestAll.py file, which tests the entire back end.

          I’m going to spend a couple days readying a console-only version of Picalo with these back end structures.  Picalo needs to see more use in the general Python community — there’s really nothing else like it out there for database analysis.

          Once I finish with the text-based release, I’ll move to the GUI.  The Picalo GUI application needs several changes to support the new structures, especially the top-level Project object.

          Things keep moving along…

          Initial speed/memory tests

            Monday, January 31st, 2011one Commented
            Categorized Under: Uncategorized

            I just finished my initial speed tests with the ZODB back end.  On my computer, speed is about 1/2 or 1/3 what it is with everything in memory (i.e. Picalo tables using Python lists).  So there is definitely a cost to the new structure.

            Memory usage is a huge win, which is what this whole change is about.  When I’m only running the Picalo libraries (no GUI, etc.), memory tops out at about 60-70 MB.  This occurs regardless of the number of records I put into the table: 10K, 100K, even a million or more.  ZODB is doing what it claims: it’s managing memory for me.  It automatically swaps things in and out of memory.

            With previous versions of Picalo–where everything was kept in memory–resource usage grew relative to the size of the tables open.  If a 500M file was opened, Picalo took 500M (actually quite a bit more because of the overhead of each record in memory).  With the new ZODB backend, it tops out and uses resources efficiently.

            Another big win is I was able to keep the old-style tables available.  In other words, if you want to use in-memory tables like Picalo had before, you are welcome to.  The ZODB memory-managed tables only occurs when you add a table to a “Project” object.  Tables inside a project object are memory-managed.  Tables outside projects (created on the fly, loaded from CSV, etc.) are done in memory.  Switching between the two is easy: simply add a table to a project to use ZODB, remove it to stop using it.

            Finally, I’ve taken the opportunity to create virtual column mapping within the Table structure.  When columns are added, deleted, moved, or rearranged in any other way, only a virtual mapping is switched.  It’s essentially instantaneous.  These operations could take quite a while in previous versions of Picalo.  In this new version, table-level operations are fast.

            I’m currently working through the TestAll.py file, which tests all the internal functions in Picalo.  I’m fixing any bugs these changes have caused.  When  I’m done, I’ll work on the GUI layer.  It needs some modification to deal with the new Project objects.  Then I’ll release a beta version for people to react to.

            Picalo 5

              Wednesday, January 12th, 20117 Commented
              Categorized Under: Uncategorized

              I took the last half of 2010 off for personal reasons, but I’m back into Picalo development.  This month I’m pushing towards Picalo 5.00, which will feature a revamped data repository.  For those who were waiting for PyTables, the bad news is that it didn’t work.  It is geared towards more numerical data than many Picalo users are keeping, and it made changes to data (such as adding static and calculated columns) too difficult.  The good news is that I’m pushing down another path: ZODB.  I checked into dozens of different projects that enable persistence, and much to my surprise, ZODB fit the best.

              I know that Zope can be a polarizing subject: some Python programmers love it and others hate it.  I’m not integrating into the full Zope structure, just into the Zope Database (ZODB).  ZODB has a long history of keeping all sorts of data, and it brings automatic disk persistence, transactions (commit/abort), and global undo on tables.  My initial integrations of ZODB into Picalo have been successful.  Speed is good so far, but most importantly, this new underbelly will support efficient memory management in huge tables.

              Be prepared for some API changes in Picalo 5.  The integration of ZODB requires that I make changes to the way tables are saved in directories.  The changes will be centered around the Record and Column objects.  The file format will most certainly change, so you’ll need to upgrade all your tables.  I’m not sure how many (if any?) are keeping their data in .pco format.  Please post comments if you are keeping data in .pco format.  I would assume most people are keeping data in delimited text files or databases, not in Picalo format.

              I’ll keep you posted on the progress.

              PyTables

                Wednesday, July 28th, 201010 Commented
                Categorized Under: Uncategorized

                I have not finished the work on integrating PyTables into Picalo.  We moved this summer and have been homeless for a few months while we build a home.  It’s been a difficult summer to work on my OSS projects.  I am still excited about the prospects of what PyTables can bring to Picalo, so please stay tuned!

                Goal seeking for a total value

                  Tuesday, February 16th, 2010No Commented
                  Categorized Under: Uncategorized

                  I’ve now been asked by two different people if Picalo can add (or subtract) a group of numbers together to see if any combinations equal a goal value.  This is useful when a person comes up with a missing value during an audit.  For example, if $7.07 is left after verifying some numbers, it must be the result of one or more of the values in the table.

                  The problem is going through the hundreds of thousands of potential combinations by hand.  Picalo itself doesn’t do this (yet), but here’s a script that makes quick work of it.  Note that the algorithm is *exponential*.  It takes about 1 minute for my laptop to go through 8 numbers (which results in almost 20,000 possible combinations).  If you have hundreds of numbers, prepare to wait a while.

                  The script has dummy numbers in it and seeks for a value of $7.07.  It comes up with 5 potential matches.  Readers should easily be able to use this script with their own data.  I’ll include it as a detectlet in future version of Picalo.

                  # written by Conan Albrecht
                  # to be included in a future version of Picalo
                  # released under the LGPL
                  # import the PIcalo libraries
                  from picalo import *
                  # import commonly-needed built-in libraries
                  import string, sys, re, random, os, os.path, urllib
                  import itertools
                  # the goal we want to find
                  goal_value = 7.07
                  # create our table of numbers
                  # this would normally be done by importing the data
                  # in other words, delete this part here.  i’m only creating
                  # it to make the script self-sufficient.
                  nums = Table([("Num", number, "#.00")])
                  nums.append(15.50)
                  nums.append(42.33)
                  nums.append(-33.11)
                  nums.append(5.67)
                  nums.append(22.31)
                  nums.append(-15.23)
                  nums.append(15.55)
                  nums.append(14.88)
                  nums.append(-7.33)
                  # create a table to store the results in
                  results = Table([
                  ('Formula', str),
                  ('Answer', number),
                  ('DiffFromGoal', number),
                  ])
                  try:
                  comb = {}  # use a dict for the has_key below (to remove duplicates efficiently)
                  pos_neg = [ [ abs(n), -1*abs(n) ] for n in nums.column(‘Num’) ]
                  prod = list(itertools.product(*pos_neg))
                  combmap = {}
                  for pi, p in enumerate(prod):
                  show_progress(‘Finding all possible combinations…’, float(pi) / float(len(prod)))
                  for i in range(1, len(p)+1):
                  for c in itertools.combinations(p, i):
                  if not comb.has_key(c):
                  comb[c] = None
                  # go through and fill out the results table
                  for i, c in enumerate(comb.keys()):
                  show_progress(‘Checking combinations…’, float(i) / float(len(comb)))
                  formula = ‘ + ‘.join([ '(%s)' % num for num in c])
                  total = sum(c)
                  results.append(
                  formula,
                  total,
                  abs(goal_value – total)
                  )
                  # sort the results
                  show_progress(‘Sorting…’, .99)
                  Simple.sort(results, True, ‘DiffFromGoal’)
                  finally:
                  clear_progress()

                  [codesyntax lang="python"]

                  # written by Conan Albrecht
                  
                  # to be included in a future version of Picalo
                  
                  # released under the LGPL
                  
                  # import the PIcalo libraries
                  
                  from picalo import *
                  
                  # import commonly-needed built-in libraries
                  
                  import string, sys, re, random, os, os.path, urllib
                  
                  import itertools
                  
                  # the goal we want to find
                  
                  goal_value = 7.07
                  
                  # create our table of numbers
                  
                  # this would normally be done by importing the data
                  
                  # in other words, delete this part here.  i'm only creating
                  
                  # it to make the script self-sufficient.
                  
                  nums = Table([("Num", number, "#.00")])
                  
                  nums.append(15.50)
                  
                  nums.append(42.33)
                  
                  nums.append(-33.11)
                  
                  nums.append(5.67)
                  
                  nums.append(22.31)
                  
                  nums.append(-15.23)
                  
                  nums.append(15.55)
                  
                  nums.append(14.88)
                  
                  nums.append(-7.33)
                  
                  # create a table to store the results in
                  
                  results = Table([
                  
                    ('Formula', str),
                  
                    ('Answer', number),
                  
                    ('DiffFromGoal', number),
                  
                  ])
                  
                  try:
                  
                    comb = {}  # use a dict for the has_key below (to remove duplicates efficiently)
                  
                    pos_neg = [ [ abs(n), -1*abs(n) ] for n in nums.column('Num') ]
                  
                    prod = list(itertools.product(*pos_neg))
                  
                    combmap = {}
                  
                    for pi, p in enumerate(prod):
                  
                      show_progress('Finding all possible combinations...', float(pi) / float(len(prod)))
                  
                      for i in range(1, len(p)+1):
                  
                        for c in itertools.combinations(p, i):
                  
                          if not comb.has_key(c):
                  
                            comb[c] = None
                  
                  # go through and fill out the results table
                  
                  for i, c in enumerate(comb.keys()):
                  
                    show_progress('Checking combinations...', float(i) / float(len(comb)))
                  
                    formula = ' + '.join([ '(%s)' % num for num in c])
                  
                    total = sum(c)
                  
                    results.append(
                  
                      formula,
                  
                      total,
                  
                      abs(goal_value - total)
                  
                    )
                  
                  # sort the results
                  
                  show_progress('Sorting...', .99)
                  
                  Simple.sort(results, True, 'DiffFromGoal')
                  
                  finally:
                  
                    clear_progress()

                  [/codesyntax]

                  Version 4.39

                    Friday, January 22nd, 20103 Commented
                    Categorized Under: Uncategorized

                    I posted version 4.39 today.  It’s got a lot of important changes in it.  Most importantly, I’m finally happy with the way field formatting is done.  Date and number formats are a mess inherently — so many formats, so many ways to represent things.  They affect how Picalo imports data from CSV and other formats, and they affect how values come from databases.  I’ve introduced a format specification into Picalo that should clear up the mess without breaking existing tables.

                    There are lots of other changes as well.  Those interested can read the README file.  Enjoy.

                    The Workbook is Here!

                      Wednesday, December 2nd, 20093 Commented
                      Categorized Under: Uncategorized

                      After many hours and sore wrists, I’ve finished the Picalo Workbook.  It’s a hands-on introduction to Picalo.  It’s got exercises, tasks, and sample datasets.  I’m hoping it will be a great introduction to Picalo for new users.

                      The workbook is the result of some training I did for the a client two weeks ago.  It wanted exercises for the group to go through.  After the training, I took the exercises and sample data sets and formally put them into the workbook.  Note that the datasets were entirely generated and have no relation to the client.

                      The workbook is still missing a few items.  I also found some bugs in Picalo while working through it.  I also need to edit the text to ensure grammar is correct.  But for now, it’s 95 percent there and is up on the web site in the documentation section.

                      P.S. For those users waiting for PyTables integration, it’s still coming.  I want to crank through these bugs I found, then I’ll get back to work on it.  I’m hoping the Christmas break will give me some time to get it pushed forward quite a bit.