Monday, February 1, 2010

Taxonomy twitches

There has been a very interesting thread on the Taxacom list recently. Entitled "Do rogue taxonomists need rogue publishers?," it is an attempt to address what is seen as poor, or even malicious, taxonomy. The initial complaint was that one of these rogue taxonomists has published an inappropriate description in an inappropriate forum. A lengthy discussion ensued.

Everyone agreed that providing names for organisms is inherently very important. Most believe the Codes (the ICZN, here) can, or at least are intended to, provide "standards, sense and stability for animal names in science." Some propose that taxon names should only be validly published by some predefined "whitelist." Others propose that we should just all get along, and described some heroic transcontinental fish-naming event of grand proportion. Some think anyone should be able to publish anything, and if it fits with variously selected parts of The Code (there were no guidelines for selecting one's own subset of the Code to follow), the rest of us are just stuck with it. Others think it's up to the rest of us to ignore "bad" names, or to publish revisions. Someone tries to divert attention to Apple's latest press conference. Someone suggests a complicated computerized registration process, followed by electronic censorship. All studiously ignore the 800-pound gorilla lurking in the center of the room.

Let's briefly pretend that we're scientists of another discipline - chemists, perhaps. Say you've set some mist nets in your back yard (I confess to not being a chemist, but I assume this is how new chemicals are collected), and something you don't recognize has shown up. You collect it, do some tests, describe it's chemical properties, and a species is named. This name is self-evident; another chemist might have acquired a member of the same species anywhere on the planet any time in history and, upon comparison, you would both instantly and unambiguously recognize that you're talking about the same species. The compound's own makeup would have dictated what you both call it, universally and regardless of local names or minor variations. There exists a taxonomy based upon the physical properties of the compound that allows this to happen. Nifty, eh?

That, boys and girls, is science. A chemist might not know what "chumvi" or "salt" is, but you can bet they'll know NaCl. Salt is always and unambiguously salt - no chemist will ever try to convince you that it's properly a subspecies of table sugar, or that kosher salt deserves a promotion while table salt and sea salt are indistinguishable. Any such claims would be testable, and all competent testers would reach the same conclusions. That isn't to say that EVERYONE would reach the same conclusions. Speakers of Swahili will continue to ask you to pass the chumvi, and cooks will continue to promote their favorite brand of sea salt. Individual variation is expected, and people who can appreciate it have ways to deal with it. However, all of these people, with proper training, are capable of recognizing and naming the basic substance as NaCl, regardless of contaminants or structural variants.

At least one effort has attempted to bring this level of quantification to biology, and was met with considerable scorn and resistance. The arguments are endless (and mostly self-centered) as to why this is a fatal mistake, but I now firmly believe it to be the only plausible future for taxonomy, should that discipline wish to provide something truly useful to managers and conservationists. The locals will always talk about "deer mice," and biologists may continue to talk about variations within individuals, populations, and species, but if we are to elevate naming to the level of science we must find testable ways to pigeonhole individuals. The most obvious involve genetic sequencing, which may determined from any individual and compared to a standard. There are problems to overcome, and the final solution may not be simple, but I do not believe simplicity is a consideration. Perhaps several dependent genetic tests are required - Test 1 tells you this is an animal and dictates Test 2a, which tells you this is a mammal and dictates Test 3m, which tells you this is a deer mouse. I am not suggesting any particular test or set of tests, nor am I advocating any categorical breaks in applying the results. I am merely advocating a quantifiable method by which we can measure similarity and differences between individuals.

Applying these measurements to wild populations, specimens in museums, or collected individuals is another matter entirely. However that happens, it should have a quantifiable basis. Saying "I consider this is a member of that population because the X sequence varies by less than y%" is a very different matter than saying "I consider this is a member of that population because I've arbitrarily chosen to accept one long-dead author's opinion over another" or "some Code requires me to use this arbitrary string, even though nobody else knows what I'm talking about," or "this thing just looks like it's a member of the arbitrary group that I consider, for deeply personal reasons, to be that species."

What shall it be, taxonomists? Continuation of an obviously flawed system (which happens to pay your salary whilst keeping you firmly in your comfort zone), or a serious discussion about progressing the art of naming organisms to a quantitative science?

Friday, October 2, 2009


Complaint: Arctos isn't customizable enough. The screens are mostly fixed, and collections can't change labels at will.

Response: I don't think you've thought the problem all the way through! Two things are at issue here: A concept called normalization, and the issue of biodiversity standards.

Any database that requires extensive customization to support a diversity of collections simply isn't doing one or both of those things.

Normalization allows us to use the data itself as a label. Arctos's Attributes screen, for example:

While no existing standards tell us what to label each field in a database, common terminology is something worth striving for. When there's nothing to follow, we in the Arctos community strive to lead.

Tuesday, May 12, 2009

Arctos v. Specify: A comparison

Arctos Specify
Description Enterprise software, hardware, backups (one in Fairbanks, one in Austin, one in San Diego), professional sysadmin. Software. User responsible for hardware setup, sysadmin, backups, etc.
Cost Software freely available to noncommercial enterprises. Hosting, development, and administrative costs are shared and negotiable. Free software, requires hardware, someone to maintain it, a defensible backup strategy, and network access.
Development Model Release early, release often. Let the users intimately guide every aspect of progress. Formalized issue tracking, Steering Committee, Advisory Committee. Release infrequently. May consider user input from the Specify Forum.
Front End ColdFusion (system works under PHP, Java, et al.) Java (including business rules)
Data Model Highly normalized; easily “pluggable” and expandable. 83 tables, 836 columns Denormalized. 143 tables, 2400 columns (as of 1 May 2009)
Back End Oracle – an enterprise-class RDBMS known for its concurrency management and stability. MySQL – a lightweight open-source RDBMS designed for fast query access. Not designed for archival usage. Limited concurrency management.
Business Rules In DB, where they’re always enforced. In application layer, where they may be bypassed (by DB updates, add-on applications, or Application bugs)
Permissions In DB, where they’re always enforced. May be used to define Virtual Private Databases. In application layer.
Security Independent layers in application and DB. Professionally managed and audited. In application layer, determined by system administrator.
Bulk Import No practical record limit. 2000-row limit.
Interfaces Intuitive customizable web applications. “Roll your own” queries against tables.
Taxonomy Formal separation of taxonomy and determinations. Accommodates composite taxonomy (hybrids, multiple taxa in one object) through identification formulae. Determinations treated as taxonomy.
Object Tracking Individual Specimen Parts are tracked and loaned. Cataloged Items are tracked and loaned.
Online Access Integral Coming soon? Limited to query only - limited data available?
Batch edits Most data; many access points None
DiGIR/Tapir Integral and automatic. Live data served. Coming soon? Manually maintained cache.
Media Relate to any “node.” Stored anywhere on Internet, or uploaded to server. ( +100K images, 3.8 TB at TACC) Stored on local filesystem or MorphBank.
System Requirements Reasonably modern browser and Internet access “Lowest common denominator”
Publications/Citations Inherent Unclear
Living Collections No apparent obstacles, but untested. Possible future development if the community wants to develop a separate schema.
Business Model Short-term NSF and institutional (MVZ, UAM, MCZ, & MSB) support. Short-term NSF and institutional support.
Data Quality Defined and enforced by the Arctos community. Left to individual operators.
Customizations User-customizable search and results. Collection-specific appearance and CSS. Operator customizable search and results.
Mapping BerkeleyMapper, Google Maps, Google Earth, download KML. Uncertainty represented as error circles. Point mapping via downloadable KML
Saved queries Save, name, email dynamic queries Save static results sets. Email to agents with email addresses.
Taxon-specific attributes User-definable, infinitely expandable determinations. Allows adding any biological collection. Predefined assertions.

Tuesday, January 13, 2009

Specify: Competition?

I expect you've seen that Specify is now cross-platform:

I've been browsing their screenshots & they've got some nice additions. I expect this is our main competitor, and given that it's 'free' to users and so much advertising etc behind it, I expect it to become even more entrenched than it is now.

I don't really consider Specify competition - they support a completely different paradigm.

They are not very dynamic, offer little in the way of user customizations, and aren't very responsive to user needs. They provide software, but you need to maintain your own system - which typically means dicey consumer hardware and craptacular backup strategies. They don't change anything quickly, and seem to have an effective pre-release cycle, so they do release stable software.

Arctos is (potentially and increasingly) centralized, meaning you get Enterprise-caliber hardware and software, system-wide tech support from those of us who wrote it, and planned and tested backup strategies, all for a fraction of the actual cost (and perhaps even much more for much less sometime soon). We'll listen to your ideas, implement them if they're not too wacky, and even get you set up to write code if you want. We follow a release early/release often strategy (which mostly means we're too poor to pay for actual testers, but also means that you're likely to see requested changes very quickly).

Specify is software. Arctos is a system (from which you can use the software if you so desire). To implement Specify and get Arctos-like reliability, you'd need a pair of servers ($5000 each), on-site backup ($2000 + tapes [$50 per day - as many as you can afford, but at least 30]), off-site backup (Amazon's S3 would be one option - something less than $500/month), a firewall, and ideally security folks, systems administration folks (software and hardware support), and database folks. They you're still left running MySQL as a backend. That's a fabulous little database for things like retrieval speed, but I would NEVER trust it to maintain archival or sensitive data, especially in an online environment. It was not designed for that, doesn't support the tools necessary to do so securely, and will never be comparable to Enterprise-class software like Oracle (they could have at least picked PostGreSQL!). Specify's ability to run on much less does not mean that doing so is a good idea. I think fulfilling our public trust obligations demands more (and, increasingly, so does NSF). Free does not necessarily imply inexpensive when system costs are considered.

Arctos's to-user costs are significantly less than Specify's. We're currently asking for $5000 per participant per year - the cost of one decent Dell server or the ColdFusion license. We've reached a size where we can ask for additional support to further reduce that cost while simultaneously increasing resources. There is a proposal in the pipeline that would do just that, in a fairly dramatic fashion, while also bringing a major museum (and some very smart developers and users there) closer to the core Arctos development team.

We also differ in strategy. Specify maintains a fairly traditional schema, and their tables are increasingly wide (non-normalized). Arctos tends towards leaner normalized tables and more sophisticated coding to make that work. The payoff for us is extensibility - Gordon just spent 3 days in Chicago discussing with various DB folks something we implemented in around 20 minutes. The cost is programming - it's more difficult to write code to normalized structures (one of the original model architects once told me it was impossible - while staring at Arctos, interestingly enough).

Specify's new stratigraphy extension is a pretty good example of the differing development philosophies. They've added 3 wide tables which allows you to record values for a finite and pre-determined set of geological attributes. We've added one very narrow table which allows you to define any number of terms and values, along with who, when, and why. Our model is infinitely flexible, even in the absence of a programmer, and records determinations or opinions. Theirs is simpler to implement, requires code changes to support new types of data, and records assertions.

Some Museum folks aren't much interested in collaborating with peers, tracking non-traditional data (usage and such), participating in community initiatives (MaNIS et al.), having broad exposure for their data, or generally being on the frontlines of information accessibility. Those people will probably never be happy in Arctos, and, assuming they can provide long-term data and hardware security, Specify is probably where they should be. The half-dozen recovering Specify users I personally know, admittedly all very much proactive front-lines folks, have few nice things to say about Specify. In fairness, I probably wouldn't know the people who remain happy with Specify.

It's worth mentioning that Specify and Arctos share a basic model, and Jim Beach was involved in the early development efforts of that model. There's been much divergence, but the core tables still share names and the occasional field.

All that said, I'm not above blatantly stealing good ideas from anyone, and our development strategy nicely equips us to do so. Let me know if you see something you think we need in their screenshot gallery.

Thursday, November 6, 2008

System Requirements

I heard Arctos will only run on X. I don't have that! I have Y, and I LIKE it!!

Alternatively: I don't want to be in a shared environment! I want to build my own system, and Arctos doesn't work like that/the developers will egg my house/the license doesn't allow this.

You've heard lots of things, haven't you? They're mostly wrong. Here's the real deal.

You'll need a database. Oracle will let you re-use all of our DDL code, but we've run Arctos on other things. Oracle will run on Windows, Lunix, Solaris, and probably some other stuff.

You'll need a CFML interpreter. ColdFusion is probably required to enjoy all of Arctos' functionality, but the basics should work under several interpreters. CF will run on Linux or Windows, and has been unofficially ported to a few other things.

That is all. The developers aren't likely to talk to you unless you're running the core code under Oracle and ColdFusion in a reasonable environment, but they won't accept your money either.

There are other niceties - Apache is more stable than ColdFusion's built-in webserver, for example. You'll want a reasonable amount of RAM - as in all things, more is better. You'll want some disk space, especially if you'll host Media locally. A decent network connection is nice.

Anything else is simply not required, but do not expect blazing performance or fabulous stability in an untuned or poorly-hosted environment. Your knockoff implementation might not work like you think it should, but that won't be the fault of Arctos or it's developers.

Basic specimen data requirements

ZOMG! I've just got a dead rat! I don't know or care what a Project or Loan is! This is too complicated!

Arctos has very minimalistic requirements for entering new data, all of which can be some form of "unknown." You must have an Identification ("unidentifiable" is OK), a Collector ("unknown" is OK), Geography ("no higher geography recorded" is OK), a Locality ("No specific locality recorded" is OK), and an Event ("between January first, 4.5 billion BC to the present" is OK). Everything else is optional. We would encourage Curators and Directors to have higher standards, but we will not encode more extensive requirements.

Taxonomy and friends

Arctos somehow owns my data! I need to use a particular format for loan numbers/taxon concept/scientific name/font/screen brightness/interpretation of the number two, and Arctos won't let me! This is hijacking! Help!!!

Our place in the community

Arctos is an attempt to empower Curators. Arctos, as a system, concept, or implementation, has neither the capacity nor inclination to change how it's users do business. We have our own electrons and are not interested in stealing yours. We understand that different collections and institutions will have differing ideas about how things should be done, and we embrace that. We do attempt to provide the tools to build upon the efforts of others and to share your data and results with others. We believe we are better at supporting the freedom to accurately record your data while supporting data standardization efforts than any other natural history management system.

Arctos is open-source. You are free to take the code, under the terms of our license, and do with it what you will. You are also free, under certain restrictions (you cannot interfere with how others do business, nor severely and negatively impact system performance), to submit your code for incorporation into the Arctos project.

All Data Definition Language code is freely available.

As an Arctos participant, you are entitled to download data or receive a regular copy of the Oracle backup files. We think this is unnecessary, but we will accommodate your needs as we can. Arctos' host, AlasConnect, also hosts Golden Valled Electric Company and most of Fairbank's electronic medical data. The Department of Homeland Security conducts regular audits. Specimen data is at least as secure as your electric bill, the software that regulates your power, and your latest medical images.

Media stored in the iRODS system is archived at two supercomputer centers.

We firmly believe that your data are as secure as electronic data can be.


Arctos continually attempts to provide access to external taxon authorities while allowing curatorial users - you know, those folks who create taxonomy - the tools and flexibility they need to do their jobs. So, while we may suggest taxon concepts from IPNI, uBio, ITIS, or other resources, we will never limit your choices to just those things.


Identifiers - loan and accession numbers, field numbers, and soon catalog numbers may be entered in most any format. We encourage standardization, and can provide additional tools (such as incrementers) for standardized data, but there are very few actual requirements.