gawkinet: PROTBASE

 
 3.10 PROTBASE: Searching Through A Protein Database
 ===================================================
 
      Hoare's Law of Large Problems: Inside every large problem is a
      small problem struggling to get out.
 
    Yahoo's database of stock market data is just one among the many
 large databases on the Internet.  Another one is located at NCBI
 (National Center for Biotechnology Information).  Established in 1988 as
 a national resource for molecular biology information, NCBI creates
 public databases, conducts research in computational biology, develops
 software tools for analyzing genome data, and disseminates biomedical
 information.  In this section, we look at one of NCBI's public services,
 which is called BLAST (Basic Local Alignment Search Tool).
 
    You probably know that the information necessary for reproducing
 living cells is encoded in the genetic material of the cells.  The
 genetic material is a very long chain of four base nucleotides.  It is
 the order of appearance (the sequence) of nucleotides which contains the
 information about the substance to be produced.  Scientists in
 biotechnology often find a specific fragment, determine the nucleotide
 sequence, and need to know where the sequence at hand comes from.  This
 is where the large databases enter the game.  At NCBI, databases store
 the knowledge about which sequences have ever been found and where they
 have been found.  When the scientist sends his sequence to the BLAST
 service, the server looks for regions of genetic material in its
 database which look the most similar to the delivered nucleotide
 sequence.  After a search time of some seconds or minutes the server
 sends an answer to the scientist.  In order to make access simple, NCBI
 chose to offer their database service through popular Internet
 protocols.  There are four basic ways to use the so-called BLAST
 services:
 
    * The easiest way to use BLAST is through the web.  Users may simply
      point their browsers at the NCBI home page and link to the BLAST
      pages.  NCBI provides a stable URL that may be used to perform
      BLAST searches without interactive use of a web browser.  This is
      what we will do later in this section.  A demonstration client and
      a 'README' file demonstrate how to access this URL.
 
    * Currently, 'blastcl3' is the standard network BLAST client.  You
      can download 'blastcl3' from the anonymous FTP location.
 
    * BLAST 2.0 can be run locally as a full executable and can be used
      to run BLAST searches against private local databases, or
      downloaded copies of the NCBI databases.  BLAST 2.0 executables may
      be found on the NCBI anonymous FTP server.
 
    * The NCBI BLAST Email server is the best option for people without
      convenient access to the web.  A similarity search can be performed
      by sending a properly formatted mail message containing the
      nucleotide or protein query sequence to <blast@ncbi.nlm.nih.gov>.
      The query sequence is compared against the specified database using
      the BLAST algorithm and the results are returned in an email
      message.  For more information on formulating email BLAST searches,
      you can send a message consisting of the word "HELP" to the same
      address, <blast@ncbi.nlm.nih.gov>.
 
    Our starting point is the demonstration client mentioned in the first
 option.  The 'README' file that comes along with the client explains the
 whole process in a nutshell.  In the rest of this section, we first show
 what such requests look like.  Then we show how to use 'gawk' to
 implement a client in about 10 lines of code.  Finally, we show how to
 interpret the result returned from the service.
 
    Sequences are expected to be represented in the standard IUB/IUPAC
 amino acid and nucleic acid codes, with these exceptions: lower-case
 letters are accepted and are mapped into upper-case; a single hyphen or
 dash can be used to represent a gap of indeterminate length; and in
 amino acid sequences, 'U' and '*' are acceptable letters (see below).
 Before submitting a request, any numerical digits in the query sequence
 should either be removed or replaced by appropriate letter codes (e.g.,
 'N' for unknown nucleic acid residue or 'X' for unknown amino acid
 residue).  The nucleic acid codes supported are:
 
      A --> adenosine               M --> A C (amino)
      C --> cytidine                S --> G C (strong)
      G --> guanine                 W --> A T (weak)
      T --> thymidine               B --> G T C
      U --> uridine                 D --> G A T
      R --> G A (purine)            H --> A C T
      Y --> T C (pyrimidine)        V --> G C A
      K --> G T (keto)              N --> A G C T (any)
                                    -  gap of indeterminate length
 
    Now you know the alphabet of nucleotide sequences.  The last two
 lines of the following example query show you such a sequence, which is
 obviously made up only of elements of the alphabet just described.
 Store this example query into a file named 'protbase.request'.  You are
 now ready to send it to the server with the demonstration client.
 
      PROGRAM blastn
      DATALIB month
      EXPECT 0.75
      BEGIN
      >GAWK310 the gawking gene GNU AWK
      tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
      caccaccatggacagcaaa
 
    The actual search request begins with the mandatory parameter
 'PROGRAM' in the first column followed by the value 'blastn' (the name
 of the program) for searching nucleic acids.  The next line contains the
 mandatory search parameter 'DATALIB' with the value 'month' for the
 newest nucleic acid sequences.  The third line contains an optional
 'EXPECT' parameter and the value desired for it.  The fourth line
 contains the mandatory 'BEGIN' directive, followed by the query sequence
 in FASTA/Pearson format.  Each line of information must be less than 80
 characters in length.
 
    The "month" database contains all new or revised sequences released
 in the last 30 days and is useful for searching against new sequences.
 There are five different blast programs, 'blastn' being the one that
 compares a nucleotide query sequence against a nucleotide sequence
 database.
 
    The last server directive that must appear in every request is the
 'BEGIN' directive.  The query sequence should immediately follow the
 'BEGIN' directive and must appear in FASTA/Pearson format.  A sequence
 in FASTA/Pearson format begins with a single-line description.  The
 description line, which is required, is distinguished from the lines of
 sequence data that follow it by having a greater-than ('>') symbol in
 the first column.  For the purposes of the BLAST server, the text of the
 description is arbitrary.
 
    If you prefer to use a client written in 'gawk', just store the
 following 10 lines of code into a file named 'protbase.awk' and use this
 client instead.  Invoke it with 'gawk -f protbase.awk protbase.request'.
 Then wait a minute and watch the result coming in.  In order to
 replicate the demonstration client's behavior as closely as possible,
 this client does not use a proxy server.  We could also have extended
 the client program in SeeRetrieving Web Pages GETURL, to implement
 the client request from 'protbase.awk' as a special case.
 
      { request = request "\n" $0 }
 
      END {
        BLASTService     = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80"
        printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService
        printf "Content-Length: " length(request) "\n\n"         |& BLASTService
        printf request                                           |& BLASTService
        while ((BLASTService |& getline) > 0)
            print $0
        close(BLASTService)
      }
 
    The demonstration client from NCBI is 214 lines long (written in C)
 and it is not immediately obvious what it does.  Our client is so short
 that it _is_ obvious what it does.  First it loops over all lines of the
 query and stores the whole query into a variable.  Then the script
 establishes an Internet connection to the NCBI server and transmits the
 query by framing it with a proper HTTP request.  Finally it receives and
 prints the complete result coming from the server.
 
    Now, let us look at the result.  It begins with an HTTP header, which
 you can ignore.  Then there are some comments about the query having
 been filtered to avoid spuriously high scores.  After this, there is a
 reference to the paper that describes the software being used for
 searching the data base.  After a repetition of the original query's
 description we find the list of significant alignments:
 
      Sequences producing significant alignments:                        (bits)  Value
 
      gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733...    38  0.20
      gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115...    38  0.20
      emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57...    38  0.20
      emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35...    38  0.20
      emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H...    38  0.20
      emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276...    38  0.20
      gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169...    38  0.20
 
    This means that the query sequence was found in seven human
 chromosomes.  But the value 0.20 (20%) means that the probability of an
 accidental match is rather high (20%) in all cases and should be taken
 into account.  You may wonder what the first column means.  It is a key
 to the specific database in which this occurrence was found.  The unique
 sequence identifiers reported in the search results can be used as
 sequence retrieval keys via the NCBI server.  The syntax of sequence
 header lines used by the NCBI BLAST server depends on the database from
 which each sequence was obtained.  The table below lists the identifiers
 for the databases from which the sequences were derived.
 
      Database Name                     Identifier Syntax
      ============================      ========================
      GenBank                           gb|accession|locus
      EMBL Data Library                 emb|accession|locus
      DDBJ, DNA Database of Japan       dbj|accession|locus
      NBRF PIR                          pir||entry
      Protein Research Foundation       prf||name
      SWISS-PROT                        sp|accession|entry name
      Brookhaven Protein Data Bank      pdb|entry|chain
      Kabat's Sequences of Immuno...    gnl|kabat|identifier
      Patents                           pat|country|number
      GenInfo Backbone Id               bbs|number
 
    For example, an identifier might be 'gb|AC021182.14|AC021182', where
 the 'gb' tag indicates that the identifier refers to a GenBank sequence,
 'AC021182.14' is its GenBank ACCESSION, and 'AC021182' is the GenBank
 LOCUS. The identifier contains no spaces, so that a space indicates the
 end of the identifier.
 
    Let us continue in the result listing.  Each of the seven alignments
 mentioned above is subsequently described in detail.  We will have a
 closer look at the first of them.
 
      >gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4
                   unordered pieces
                Length = 176383
 
       Score = 38.2 bits (19), Expect = 0.20
       Identities = 19/19 (100%)
       Strand = Plus / Plus
 
      Query: 35    tggtgaagtgtgtttcttg 53
                   |||||||||||||||||||
      Sbjct: 69786 tggtgaagtgtgtttcttg 69804
 
    This alignment was located on the human chromosome 7.  The fragment
 on which part of the query was found had a total length of 176383.  Only
 19 of the nucleotides matched and the matching sequence ran from
 character 35 to 53 in the query sequence and from 69786 to 69804 in the
 fragment on chromosome 7.  If you are still reading at this point, you
 are probably interested in finding out more about Computational Biology
 and you might appreciate the following hints.
 
   1. There is a book called 'Introduction to Computational Biology' by
      Michael S. Waterman, which is worth reading if you are seriously
      interested.  You can find a good book review on the Internet.
 
   2. While Waterman's book can explain to you the algorithms employed
      internally in the database search engines, most practitioners
      prefer to approach the subject differently.  The applied side of
      Computational Biology is called Bioinformatics, and emphasizes the
      tools available for day-to-day work as well as how to actually
      _use_ them.  One of the very few affordable books on Bioinformatics
      is 'Developing Bioinformatics Computer Skills'.
 
   3. The sequences _gawk_ and _gnuawk_ are in widespread use in the
      genetic material of virtually every earthly living being.  Let us
      take this as a clear indication that the divine creator has
      intended 'gawk' to prevail over other scripting languages such as
      'perl', 'tcl', or 'python' which are not even proper sequences.
      (:-)