gawkinet: URLCHK

 
 3.4 URLCHK: Look for Changed Web Pages
 ======================================
 
 Most people who make heavy use of Internet resources have a large
 bookmark file with pointers to interesting web sites.  It is impossible
 to regularly check by hand if any of these sites have changed.  A
 program is needed to automatically look at the headers of web pages and
 tell which ones have changed.  URLCHK does the comparison after using
 GETURL with the 'HEAD' method to retrieve the header.
 
    Like GETURL, this program first checks that it is called with exactly
 one command-line parameter.  URLCHK also takes the same command-line
 variables 'Proxy' and 'ProxyPort' as GETURL, because these variables are
 handed over to GETURL for each URL that gets checked.  The one and only
 parameter is the name of a file that contains one line for each URL. In
 the first column, we find the URL, and the second and third columns hold
 the length of the URL's body when checked for the two last times.  Now,
 we follow this plan:
 
   1. Read the URLs from the file and remember their most recent lengths
 
   2. Delete the contents of the file
 
   3. For each URL, check its new length and write it into the file
 
   4. If the most recent and the new length differ, tell the user
 
    It may seem a bit peculiar to read the URLs from a file together with
 their two most recent lengths, but this approach has several advantages.
 You can call the program again and again with the same file.  After
 running the program, you can regenerate the changed URLs by extracting
 those lines that differ in their second and third columns:
 
      BEGIN {
        if (ARGC != 2) {
          print "URLCHK - check if URLs have changed"
          print "IN:\n    the file with URLs as a command-line parameter"
          print "    file contains URL, old length, new length"
          print "PARAMS:\n    -v Proxy=MyProxy -v ProxyPort=8080"
          print "OUT:\n    same as file with URLs"
          print "JK 02.03.1998"
          exit
        }
        URLfile = ARGV[1]; ARGV[1] = ""
        if (Proxy     != "") Proxy     = " -v Proxy="     Proxy
        if (ProxyPort != "") ProxyPort = " -v ProxyPort=" ProxyPort
        while ((getline < URLfile) > 0)
           Length[$1] = $3 + 0
        close(URLfile)      # now, URLfile is read in and can be updated
        GetHeader = "gawk " Proxy ProxyPort " -v Method=\"HEAD\" -f geturl.awk "
        for (i in Length) {
          GetThisHeader = GetHeader i " 2>&1"
          while ((GetThisHeader | getline) > 0)
            if (toupper($0) ~ /CONTENT-LENGTH/) NewLength = $2 + 0
          close(GetThisHeader)
          print i, Length[i], NewLength > URLfile
          if (Length[i] != NewLength)  # report only changed URLs
            print i, Length[i], NewLength
        }
        close(URLfile)
      }
 
    Another thing that may look strange is the way GETURL is called.
 Before calling GETURL, we have to check if the proxy variables need to
 be passed on.  If so, we prepare strings that will become part of the
 command line later.  In 'GetHeader()', we store these strings together
 with the longest part of the command line.  Later, in the loop over the
 URLs, 'GetHeader()' is appended with the URL and a redirection operator
 to form the command that reads the URL's header over the Internet.
 GETURL always produces the headers over '/dev/stderr'.  That is the
 reason why we need the redirection operator to have the header piped in.
 
    This program is not perfect because it assumes that changing URLs
 results in changed lengths, which is not necessarily true.  A more
 advanced approach is to look at some other header line that holds time
 information.  But, as always when things get a bit more complicated,
 this is left as an exercise to the reader.