gawkinet: WEBGRAB
3.5 WEBGRAB: Extract Links from a Page
======================================
Sometimes it is necessary to extract links from web pages. Browsers do
it, web robots do it, and sometimes even humans do it. Since we have a
tool like GETURL at hand, we can solve this problem with some help from
the Bourne shell:
BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
RT != "" {
command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
" > doc" NR ".html")
print command
}
Notice that the regular expression for URLs is rather crude. A
precise regular expression is much more complex. But this one works
rather well. One problem is that it is unable to find internal links of
an HTML document. Another problem is that 'ftp', 'telnet', 'news',
'mailto', and other kinds of links are missing in the regular
expression. However, it is straightforward to add them, if doing so is
necessary for other tasks.
This program reads an HTML file and prints all the HTTP links that it
finds. It relies on 'gawk''s ability to use regular expressions as
record separators. With 'RS' set to a regular expression that matches
links, the second action is executed each time a non-empty link is
found. We can find the matching link itself in 'RT'.
The action could use the 'system()' function to let another GETURL
retrieve the page, but here we use a different approach. This simple
program prints shell commands that can be piped into 'sh' for execution.
This way it is possible to first extract the links, wrap shell commands
around them, and pipe all the shell commands into a file. After editing
the file, execution of the file retrieves exactly those files that we
really need. In case we do not want to edit, we can retrieve all the
pages like this:
gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh
After this, you will find the contents of all referenced documents in
files named 'doc*.html' even if they do not contain HTML code. The most
annoying thing is that we always have to pass the proxy to GETURL. If
you do not like to see the headers of the web pages appear on the
screen, you can redirect them to '/dev/null'. Watching the headers
appear can be quite interesting, because it reveals interesting details
such as which web server the companies use. Now, it is clear how the
clever marketing people use web robots to determine the market shares of
Microsoft and Netscape in the web server market.
Port 80 of any web server is like a small hole in a repellent
firewall. After attaching a browser to port 80, we usually catch a
glimpse of the bright side of the server (its home page). With a tool
like GETURL at hand, we are able to discover some of the more concealed
or even "indecent" services (i.e., lacking conformity to standards of
quality). It can be exciting to see the fancy CGI scripts that lie
there, revealing the inner workings of the server, ready to be called:
* With a command such as:
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/
some servers give you a directory listing of the CGI files.
Knowing the names, you can try to call some of them and watch for
useful results. Sometimes there are executables in such
directories (such as Perl interpreters) that you may call remotely.
If there are subdirectories with configuration data of the web
server, this can also be quite interesting to read.
* The well-known Apache web server usually has its CGI files in the
directory '/cgi-bin'. There you can often find the scripts
'test-cgi' and 'printenv'. Both tell you some things about the
current connection and the installation of the web server. Just
call:
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
* Sometimes it is even possible to retrieve system files like the web
server's log file--possibly containing customer data--or even the
file '/etc/passwd'. (We don't recommend this!)
*Caution:* Although this may sound funny or simply irrelevant, we are
talking about severe security holes. Try to explore your own system
this way and make sure that none of the above reveals too much
information about your system.