#3 new
jtprince

to cast or not to cast?

Reported by jtprince | February 12th, 2009 @ 05:26 PM

With almost every file format and object, the decision has to be made (and remade) as to whether the data values should be cast, and if so, when?

For example:

The Query objects in the Ms::Mascot::Dat hierarchy have several different attributes, and most of these are 'really' floats and ints. A spectrum is given as a comma/colon separated string, but at the end of the day, it is really a Spectrum.

In the past, I've let the most common usage be my guide. If a person will most likely just reformat the data, then I will leave it uncast (i.e., as strings) so as to avoid the cast-uncast overhead. However, if the data will actually be used for what it is, then I am prone to cast it up-front and store the cast data.

So, here are the possibilities with some pros and cons:

  1. No casting pros: easiest to write, no overhead if just reformatting data cons: user must cast the data himself if needed, could get ugly with lots of attributes that would be cleaner in an object.

  2. Casting before pros: user can access an object's data over and over again with no real overhead, typically more memory efficient cons: maybe casting more data than will be used

  3. Casting after (i.e., the read method for an attribute does the cast) pros: only cast what the user needs (initial read is fast) cons: takes more memory, repeated access is slow.

Lots of potential ways to do it, what is preferred?

Looking at it one way, no casting is stating that we are modeling the actually file itself. Casting after the fact is along these same lines where the real data we hold is what we have in the file and we simply provide handy accessors. Casting before suggests that we are modeling what the data truly represents.

Some approaches: Uncast for the objects modeling the text file, and provide to_* methods for the objects that will tend to be more generic.


  query = Dat::Query.new("from string")
  query.ions1   # => "1221.11:3221.2,..."  (A string)
  query.to_spectrum.mzs  # => [1221.11, ...] (an array 

Another would be to reserve hash like access for the string data and cast on the method call (after the fact):


  query["ions1"]   # => "1221.11:3221.2,..."  (A string)
  query.ions1      # => A Spectrum object

Thoughts?

Comments and changes to this ticket

  • bahuvrihi

    bahuvrihi February 13th, 2009 @ 12:04 AM

    It definitely is something to think about.

    My first reaction was 'not to cast' mainly coming from a performance standpoint. Looking at a .dat file, many sections including queries have a lot of numbers that would have to be cast.

    On further reflection though, I think casting is appropriate as long as it's lazy. I think the intention of these classes is to model the data, not the file.

    Maybe a memoize approach? (I think I've read this as the correct term) It's basically your second approach... first you parse and save the string data with as little processing as possible, and then do this:

    
      def parse
        @data['ions'] = "string of ions, unsplit"
        ...
      end
    
      def ions
        @ions ||= @data.delete('ions').split.cast
      end
    

    That way you don't store the info twice, you can access it as a string if you like, and when you declare your intent to use the data as numbers (via a call to ions) you get back an array of numbers. In this system I think it's ok to lose the string data... I think it's probably quite rare to need the string and the numeric data at the same time.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Mass Spectrometry Proteomics in Ruby

People watching this ticket

Pages