Info: (gawkinet) WEBGRAB

Info Catalog
gawkinet: URLCHK
gawkinet: Some Applications and Techniques
gawkinet: STATIST
gawkinet: WEBGRAB

 
 3.5 WEBGRAB: Extract Links from a Page
 ======================================
 
 Sometimes it is necessary to extract links from web pages.  Browsers do
 it, web robots do it, and sometimes even humans do it.  Since we have a
 tool like GETURL at hand, we can solve this problem with some help from
 the Bourne shell:
 
      BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
      RT != "" {
         command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
                     " > doc" NR ".html")
         print command
      }
 
    Notice that the regular expression for URLs is rather crude.  A
 precise regular expression is much more complex.  But this one works
 rather well.  One problem is that it is unable to find internal links of
 an HTML document.  Another problem is that 'ftp', 'telnet', 'news',
 'mailto', and other kinds of links are missing in the regular
 expression.  However, it is straightforward to add them, if doing so is
 necessary for other tasks.
 
    This program reads an HTML file and prints all the HTTP links that it
 finds.  It relies on 'gawk''s ability to use regular expressions as
 record separators.  With 'RS' set to a regular expression that matches
 links, the second action is executed each time a non-empty link is
 found.  We can find the matching link itself in 'RT'.
 
    The action could use the 'system()' function to let another GETURL
 retrieve the page, but here we use a different approach.  This simple
 program prints shell commands that can be piped into 'sh' for execution.
 This way it is possible to first extract the links, wrap shell commands
 around them, and pipe all the shell commands into a file.  After editing
 the file, execution of the file retrieves exactly those files that we
 really need.  In case we do not want to edit, we can retrieve all the
 pages like this:
 
      gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh
 
    After this, you will find the contents of all referenced documents in
 files named 'doc*.html' even if they do not contain HTML code.  The most
 annoying thing is that we always have to pass the proxy to GETURL. If
 you do not like to see the headers of the web pages appear on the
 screen, you can redirect them to '/dev/null'.  Watching the headers
 appear can be quite interesting, because it reveals interesting details
 such as which web server the companies use.  Now, it is clear how the
 clever marketing people use web robots to determine the market shares of
 Microsoft and Netscape in the web server market.
 
    Port 80 of any web server is like a small hole in a repellent
 firewall.  After attaching a browser to port 80, we usually catch a
 glimpse of the bright side of the server (its home page).  With a tool
 like GETURL at hand, we are able to discover some of the more concealed
 or even "indecent" services (i.e., lacking conformity to standards of
 quality).  It can be exciting to see the fancy CGI scripts that lie
 there, revealing the inner workings of the server, ready to be called:
 
    * With a command such as:
 
           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/
 
      some servers give you a directory listing of the CGI files.
      Knowing the names, you can try to call some of them and watch for
      useful results.  Sometimes there are executables in such
      directories (such as Perl interpreters) that you may call remotely.
      If there are subdirectories with configuration data of the web
      server, this can also be quite interesting to read.
 
    * The well-known Apache web server usually has its CGI files in the
      directory '/cgi-bin'.  There you can often find the scripts
      'test-cgi' and 'printenv'.  Both tell you some things about the
      current connection and the installation of the web server.  Just
      call:
 
           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi
           gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
 
    * Sometimes it is even possible to retrieve system files like the web
      server's log file--possibly containing customer data--or even the
      file '/etc/passwd'.  (We don't recommend this!)
 
    *Caution:* Although this may sound funny or simply irrelevant, we are
 talking about severe security holes.  Try to explore your own system
 this way and make sure that none of the above reveals too much
 information about your system.
Info Catalog
gawkinet: URLCHK
gawkinet: Some Applications and Techniques
gawkinet: STATIST