gawkinet: GETURL

 
 3.2 GETURL: Retrieving Web Pages
 ================================
 
 GETURL is a versatile building block for shell scripts that need to
 retrieve files from the Internet.  It takes a web address as a
 command-line parameter and tries to retrieve the contents of this
 address.  The contents are printed to standard output, while the header
 is printed to '/dev/stderr'.  A surrounding shell script could analyze
 the contents and extract the text or the links.  An ASCII browser could
 be written around GETURL. But more interestingly, web robots are
 straightforward to write on top of GETURL. On the Internet, you can find
 several programs of the same name that do the same job.  They are
 usually much more complex internally and at least 10 times longer.
 
    At first, GETURL checks if it was called with exactly one web
 address.  Then, it checks if the user chose to use a special proxy
 server whose name is handed over in a variable.  By default, it is
 assumed that the local machine serves as proxy.  GETURL uses the 'GET'
 method by default to access the web page.  By handing over the name of a
 different method (such as 'HEAD'), it is possible to choose a different
 behavior.  With the 'HEAD' method, the user does not receive the body of
 the page content, but does receive the header:
 
      BEGIN {
        if (ARGC != 2) {
          print "GETURL - retrieve Web page via HTTP 1.0"
          print "IN:\n    the URL as a command-line parameter"
          print "PARAM(S):\n    -v Proxy=MyProxy"
          print "OUT:\n    the page content on stdout"
          print "    the page header on stderr"
          print "JK 16.05.1997"
          print "ADR 13.08.2000"
          exit
        }
        URL = ARGV[1]; ARGV[1] = ""
        if (Proxy     == "")  Proxy     = "127.0.0.1"
        if (ProxyPort ==  0)  ProxyPort = 80
        if (Method    == "")  Method    = "GET"
        HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
        ORS = RS = "\r\n\r\n"
        print Method " " URL " HTTP/1.0" |& HttpService
        HttpService                      |& getline Header
        print Header > "/dev/stderr"
        while ((HttpService |& getline) > 0)
          printf "%s", $0
        close(HttpService)
      }
 
    This program can be changed as needed, but be careful with the last
 lines.  Make sure transmission of binary data is not corrupted by
 additional line breaks.  Even as it is now, the byte sequence
 '"\r\n\r\n"' would disappear if it were contained in binary data.  Don't
 get caught in a trap when trying a quick fix on this one.