MyISAM or InnoDB? (an ongoing question)

There seems to be posts like this that pop up weekly. Choosing the “right” storage engine is a basic first step in setting up the experimental workflow environment to operate in. Each has pros and cons, but it always comes back to what you want/expect to be able to do with your data.

What should I use MYISAM or INNODB?

Posted in DBMS. Tags: . Leave a Comment »

Drizzle

WTF? This may be applicable once it gets off the ground. A stripped down version of MySQL (that will also have the latest InnoDB architecture). Leaner and meaner options are always tempting.

Drizzle: Lightweight SQL Database for Cloud and Web

Posted in DBMS. Tags: . Leave a Comment »

Inserting Data Into Tables

After having created and set up tables for an experiment they’re ready to have data loaded into them. In this example, I will load ascii files that have already been formatted into a structure that matches the tables to which I’m loading. I’ve again embedded the procedure in a python script, so I can load multiple ascii files (pertaining, say, to multiple experimental conditions and runs) pretty easily.

#!/usr/bin/python

import commands
import os
import sys
import time
import MySQLdb

subjs = ['6']
conditions = ['classaction','classobject','veraction','verobject']
hemis = ['lh','rh']
runs = ['1','2','3','4','5','6']


try:
    connection = MySQLdb.connect(read_default_file="~/.my.cnf",db="yourdb" )
except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit (1)

cursor = connection.cursor()

def insertdata_to_db():
    print "que hora es?\n"+time.ctime()
    try:
        file = "'/disks/gpfs/fmri/results/s0"+ss+"/SUMA/surface.data/s0"+ss+"."+h+"."+cc+"BT.run"+rr+".readyfordb.txt'"
        print "File loading: "+file
        insert_statement = "load data local infile "+file+" into table vertexdata fields terminated by ' ';"
        print "Insert statement: "+insert_statement
        cursor.execute(insert_statement)
        print "Number of rows inserted: %d" % cursor.rowcount
    except MySQLdb.Error, e:
        print "Error during s0"+ss+"_"+cc+"_"+h+" run "+rr+" %d -->> %s" % (e.args[0], e.args[1])
        sys.exit (1)


for ss in subjs:
    for cc in conditions:
        for rr in runs:
            for h in hemis:
                insertdata_to_db()

The script again uses the MySQLdb module. I’ve created a function called insertdata_to_db that uses this sql as an insert statement
load data local infile filename into table vertexdata fields terminated by ' ';
because I’m loading a file from the local fileshare, and the columns in this ascii file are separated by a space, denoted in the sql with fields terminated by ' '. The function also looks for exceptions and boots out if there is an error while loading. I then run this function within nested loops to iterate over subjs, conditions, runs, and hemis.

Creating Tables for your Experiment

[This assumes you've already set up yourself with a relational DBMS (MySQL and PostgreSQL are usually winners) and have access to create tables in your database. FYI: I'm using MySQL.]

Setting up a DBMS requires quite a bit of thought and foresight. The flexibility available is very broad, so the real constraints are more with the user rather than the infrastructure. Some questions to consider:

  • At what stage of processing do I want to put the data into tables (e.g. raw time series, motion-corrected time series, “cleaned” time series, just the HRFs for conditions)?
  • What information do I want to index (e.g. ROI identifiers, voxels, subject identifiers)?
  • What meta data do I want to hold?
  • Do I want primary key and foreign key usage?
  • What sort of relationships do I want between tables?
  • For first steps, it might be best to start simple and let the complexities emerge and be handled as more experience with the infrastructural tools is gained.

    For actually creating the tables in my database, I’ve found MySQL Administrator (part of the MySQL GUI Tools) to be extremely straightforward and helpful. Of course, procedures for creating tables can also be scripted and make use of simple SQL statements such as:

    create table afni_db_test (x INT, INDEX(x), y INT, INDEX(y), z INT, INDEX(z), consistent float, filler float, inconsistent float);
    

    In this line, the table created is called afni_db_test, columns are x, y, z, filler, and inconsistent, with x, y, and z being indices.
    And here is that statment embedded in a python script (using the MySQLdb module):
    #!/usr/bin/python
    
    import os
    import sys
    import commands
    import MySQLdb
    
    _user='you'
    _host='yourhost'
    _passwd='yourpassword'
    _db='test'
    
    try:
        connection = MySQLdb.connect(host=_host,user=_user,db=_db,passwd=_passwd)
    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])
        sys.exit (1)
    
    cursor = connection.cursor()
    
    create_table_statement = 'create table afni_db_test (x INT, INDEX(x), y INT, INDEX(y), z INT, INDEX(z), consistent float, filler float, inconsistent float);'
    
    cursor.execute(create_table_statement)
    

    Choosing the type of storage engine you would like to use is also something to consider and dependent on what you’d like to do with your data. By default, MySQL uses MyISAM, but InnoDB is also popular. The InnoDB engine supports primary and foreign key constraints, while the MyISAM engine does not. Changing the type of storage engine is very easy in the MySQL Administrator GUI. Just go to the ‘Catalogs’ tab, select your Schemata and you get a list of the tables in your database. Then, double-click the table you want and its schema will open in a new window. There, just choose the ‘Table Options’ tab and in the drop-down menu you can select a new storage engine (click for bigger image):

    More about storage engines available with MySQL can be found here.

    Simplified Data Visualization #1

    This is a simple Python script that uses the MySQLdb module and AFNI’s DriveSuma program for querying a database and taking jpeg snapshots to get an immediate representation of surface data. This particular code iterates over a set of subjects – established here via an environment variable called “LV_SS_LIST” that is a list of subject number identifiers for a particular group of subjects that were given the same experimental stimuli. The query:
    select emblemfemlh.voxel, round(speechT,4) from emblemfemlh where speechT > 1.796 and subject = "+ss+";"
    selects the voxel and a T value for the speech condition “speechT” with the parameters that those T values are greater than a threshold (1.796) for each subject.

    #!/usr/bin/python
    # this is to get query output onto SUMA brains
    
    import commands
    import os
    import sys
    import MySQLdb
    
    
    subjs = os.getenv("LV_SS_LIST").split()
    
    os.system("suma -niml &" )
    os.system("DriveSuma -com show_surf -label mamba -i_fs $SUBJECTS_DIR/HORRY/SUMA/HORRY.lh.mesh140_std.smoothwm.asc -com viewer_cont -load_view $gpfsemb/sweetview.niml.vvs -key ctrl+right -key R" )
    
    #--------------------------- connection mechanism, query and creation of tmp file ---------------
    try:
            connection = MySQLdb.connect(db = "yourdb",read_default_file="~/.my.cnf" )
    except MySQLdb.Error, e:
            print "Error %d: %s" % (e.args[0], e.args[1])
            sys.exit (1)
    
    for ss in subjs:
            id = commands.getoutput("echo $$" )
            outfile = "/tmp/outfile."+id+".threshd"+ss+".1D"
            query = "select emblemfemlh.voxel, round(speechT,4) from emblemfemlh where speechT > 1.796 and subject = "+ss+";"
            cursor = connection.cursor()
            cursor.execute(query)
            result = list(cursor.fetchall())
            bully = ''
            for rr in list(result):
                    bully += (str(rr[0])+" "+str(rr[1]))+"\n"
    
            file = open(outfile,"w" )
            file.write(bully)
            file.close()
            os.system("DriveSuma -com surf_cont -load_dset "+outfile+" -com surf_cont -1_only y -I_sb 1 -I_range 10 -switch_cmap afni_p11 -com recorder_cont -save_as "+outfile+".jpg" )
    
    #---------------------------- kill suma below ----------------
    os.system("DriveSuma -com kill_suma" )
    

    MySQL for Python

    MySQLdb is the Python DB API-2.0 interface. This module has proven very useful for coding db queries into procedures in Python (my preferred coding language). Below is a simple procedure using this module that performs a query and writes the result to a file, where it can be used for analysis in R:

    #!/usr/bin/python
    
    import commands
    import os
    import sys
    import MySQLdb
    
    
    try:
        connection = MySQLdb.connect(db="yourdb",read_default_file="~/.my.cnf" )
    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])
        sys.exist (1)
    
    query = "select subject, vertex, speech_lag, emblem_lag, embspeech_lag from ccf_phase2_lh where seed_region = 'IDEAL' and vertex between 1 and 5;"
    
    outfile = os.getcwd()+"/test_aov.lh.txt"
    cursor = connection.cursor()
    cursor.execute(query)
    result = list(cursor.fetchall())
    bully = ''
    for rr in list(result):
        bully += str(rr[0])+" "+str(rr[1])+" "+str(rr[2])+" "+str(rr[3])+" "+str(rr[4])+"\n"
    
    
    file = open(outfile,'w' )
    file.write(bully)
    file.close()
    

    MySQLdb can be found here.

    Posted in DBMS, Python. Tags: , . 1 Comment »

    Permutation distribution of Friedman Test data

    Because of the need to correct for multiple comparisons, I explored building a permutation distribution for the data analyzed using the Friedman test. This paper describes a procedure for doing so; however, it’s intricacies are pretty deep and tangled (for me, at least). The procedures it describes though involve switching condition labels, so that principle is something applied in the approach I took for my data. Basically, for half of the participants (selected via random sampling ) the condition labels are switched in random order. To maintain the proper shuffling of labels (for condition labels applied to specific subjects) a configuration file was first generated using this R code:

    cc <- as.vector(c(rep(0,12),rep(1,12)))
    con <- matrix(nrow=4000,ncol=27)
    mat_row <- 0;
    
    for (i in 1:nrow(con)){
        mat_row <- mat_row + 1
        con[mat_row,] <- c(sample(cc),sample(1:3))
    }
    write.table(con,file="perm_config_file.txt",row.names=FALSE,col.names=FALSE,quote=FALSE)
    

    This creates a 4000 X 27 configuration matrix, where each row is twenty-four long combination of 1′s and 0′s (12 and 12 for distinguishing which 12 of the 24 total subjects that will have labels switched) and 1, 2, 3 sampled in random order (for the ordering of the three conditions). One line then configures one permutation. There are 4000 rows for 4000 permutations of the whole-brain data to be run.

    This is the R code for running these permutations via a Swift workflow:

    #---- doing friedman permutations
    #---- coded Wednesday; July 2, 2008
    
    #---- this is the function used in the aggregate:
    FriedmanPerm <- function(x){
        mm <- matrix(nrow=24,ncol=3)
        mm[,1] = x[1:24]
        mm[,2] = x[25:48]
        mm[,3] = x[49:72]
        a <- mm[subj_resamp,cond_resamp]
        b <- mm[subj_same,]
        mm_shuffle <- rbind(a,b)
        return(friedman.test(mm_shuffle)[[1]][[1]])
    }
    
    #---- Swift housekeeping:
    allinputs <- Sys.getenv("R_SWIFT_ARGS");
    print(Sys.getenv("R_SWIFT_ARGS")); 
    outname <- noquote(strsplit(allinputs," ")[[1]][1])
    configLine <- as.integer(noquote(strsplit(allinputs," ")[[1]][2]))
    print(configLine)
    
    inputfile <- Sys.getenv("R_INPUT");
    print(inputfile)
    Query_out <- as.matrix(read.table(inputfile))
    config <- as.matrix(read.table("perm_config_file.txt"))
    cond_resamp <- as.vector(config[configLine,25:27])
    subj_resamp <- which(as.vector(config[configLine,1:24])==1)
    subj_same <- which(as.vector(config[configLine,1:24])==0)
    
    #---- this is where the fun begins:
    data_stack <- stack(data.frame(Query_out[,3:5]))[1]
    vertices <- Query_out[,2]
    nn <- data.frame(cbind(vertices,data_stack))
    m <- matrix(nrow=length(as.integer(levels(as.factor(vertices)))),ncol=2)
    m[,1] <- as.integer(levels(as.factor(vertices)))
    m[,2] <- aggregate(nn$values, list(nn$vertices),FriedmanPerm)[,2]
    write.table(round(m,5), file=paste(configLine,"PermFriedman",outname,".txt",sep=""), row.names=FALSE, col.names=FALSE, quote=F)
    

    The Swift code sends this R script as well as the configuration matrix file to the grid site, where it does the query and analysis. Besides sending the configuration file, it performs the same db query and is otherwise much the same code in the previous post for executing the Friedman test.

    type file{}
    
    (file qout, file rout) run_query (string allcatargs, file config, file r_script, file r_config_table){
        app{
            Mediator allcatargs stdout=@filename(qout) @filename(r_script) @filename(r_config_table);
        }
    }
    
    string user = @arg("user");
    string db = "yourdb";
    string host = "yourhost";
    
    file r_script<single_file_mapper; file="permFriedman.R">;
    file r_config_table<single_file_mapper; file="perm_config_file.txt">;
    file config<single_file_mapper; file="user.config">;
    
    loop_query(int bvox, string user, string db, string host, string query_outline, file r_script, file config, file r_config_table, string id){
        int evox = bvox+13999;
        string baseid = "PermFriedman";
        string r_swift_args = @strcat(id);
        string theoutprefix = @strcat(id,baseid,bvox,"_",evox);
        string med_args = @strcat("--user ","andric",
            " --conf ", "user.config",
            " --db ", db,
            " --host ", host,
            " --query ", query_outline,
            " --r_script ", @filename(r_script),
            " --begin_vox ", bvox,
            " --end_vox ", evox,
            " --outprefix ", theoutprefix,
            " --batchstep ", "14000",
            " --r_swift_args ", r_swift_args,
            " --subject ", id);
        file q_result <single_file_mapper; file=@strcat("query_results/",theoutprefix,".qresult")>;
        file r_result <single_file_mapper; file=@strcat(theoutprefix,".txt")>;
        (q_result, r_result) = run_query(med_args, r_script, config, r_config_table);
    }
    
    int permbrains = [1:1000:1];
    foreach perm in permbrains {
        int mybatches = [1:196000:14000];
        foreach batch in mybatches {
            string id = @strcat(perm);
            string query_outline = @strcat("select subject, vertex, speech_lag, emblem_lag, embspeech_lag from ccf_phase2_lh where vertex between BEGIN_BATCH and END_BATCH");
            loop_query(batch, user, db, host, query_outline, r_script, config, r_config_table, id);
        }
    }
    

    Follow

    Get every new post delivered to your Inbox.