ht://Check 1.1 README
Website: http://htcheck.sourceforge.net/
Copyright (c) 1999-2002 Comune di Prato - Prato - Italy
Some Portions Copyright (c) 1995-2002 The ht://Dig Group <www.htdig.org>
Author: Gabriele Bartolini - Prato - Italy <angusgb@users.sourceforge.net>
$Id: README,v 1.12 2002/02/10 17:43:16 angusgb Exp $

ht://Check is distributed under the GNU General Public License (GPL).
See the COPYING file for license information.

===========================================================================

"Dear ht://Check user,

   ht://Check is an opensource project and it comes for free. The only 'fee' I
gently ask you to 'pay' is to spend 2 minutes of your precious time and fill the form
in the 'Uses' page, providing the name of your organisation, the URL and the country you
reside in. It is a means for letting me know how many people in the world find
this utilitiy to be useful to them! So please point your browser
to http://htcheck.sourceforge.net/?a=uses ! Thank you."

      	             	      	             	      -Gabriele Bartolini

===========================================================================


ht://Check is more than a link checker. It's a console application written for
GNU/Linux systems in C++ and derived from the best search engine available on
the Internet for free: ht://Dig.

It can retrieve information through HTTP/1.1 and store them in a MySQL database,
and it's particularly suitable for small Internet domains or Intranet.

Its purpose is to help a Webmaster managing one or more related sites: after a "crawl",
ht://Check creates a powerful data source made up of information based on the retrieved
documents. The kind of information available to the ht://Check user includes:

- single documents attributes such as: content-type, size, last modification time, etc.; 
- information regarding the retrieval process of a resource, like for instance whether
   the resource was succesfully retrieved, or not, showing the various results (the
   so-called HTTP status codes, as ht://Check uses this protocol for crawling the Web); 
- information regarding the structure of a document, basically its HTML link tags, and
   the relationships they issue, in a whole process view: basically, ht://Check is able
   to crawl a Web domain or set (in the algebrical meaning), and links create sort of
   inter-documents relationships in it. This feature, allows the user to get further
   information from the domain regarding: 
   - link results: if it either working or broken or redirected; also at the current status,
      it checks whether a link is actually an anchor that does not work, or it is a
      javascript or an e-mail;
   - the relationships between documents, in terms of incoming links and outgoing ones;
      in the future, particular attention in the development will be given to the Web
      structure mining activity. 


A skinny report is given by the program htcheck, however at the current situation most of
the information is given by the PHP interface which comes with the package and that is
able to query the database built by the htcheck program in a previously made crawl.
It goes without saying that you need a Web server to use it, and of course PHP with the
MySQL connectivity module.

By the way, as long as after a crawl ht://Check produces a database on a MySQL server,
it's needless to say that every user theoretically could build its own information
retrieval interface to this database; you only need to know the structure of it,
its tables and fields, and the relationships among them. Other solutions are represented
by independent scripts written by using common scripting languages with MySQL connectivity
modules (i.e. Perl and Python), or faster programs written in C or C++ using MySQL API or
wrapper libraries (such as MySQL++ or dbconnect), or other Web driven solutions like JSP,
ColdFusion. There exists an interface to ht://Check for the Roxen Web server
written by Michael Stenitzer (stenitzer@eva.ac.at).


ht://Check - FEATURES
=====================

ht://Check is made up of two logical parts: a "spider" which starts checking URLs
from a specific one or from a list of them; and an "analyser" which takes the
results of the first part and shows summaries (this part can be done via console
or by using the PHP interface through a web server).

The "Spider" or "Crawler"
-------------------------

- HTTP/1.1 compliant with persistent connections and cookies support
- HTTP Basic authentication supported
- HTTP Proxy support
- Crawl customisable through many configuration attributes which let the user
limit the digging on URLs pattern matchings and distance ("hops") from the first URL.
- MySQL databases directly created by the spider
- MySQL connections through user or general option files as defined by the
database system (/etc/my.cnf or ~/.my.cnf)

No support for Javascript and other protocols like HTTPS, FTP, NNTP and local files.

The "Analyser"
--------------

Just a preface: as long as all of the data after a crawl are all stored into a
MySQL database, it is pretty easy to get your desired info by querying the
database. The spider, anyway, is included into the 'htcheck' application, which
at the end shows by itself a small text report. In a second time you can always
retrieve info from that database by building your own interface (PHP, Perl for
instance) or by just using the default one written in PHP.

'htcheck' (the console appllication) gives you a summary of broken links, broken
anchors, servers seen, content-types encountered.

The PHP interface lets you perform:
- Queries regarding URLs, by choosing many discrimineting criterias such as
pattern matching, status code, content-type, size.
- Queries regarding links, with pattern matching on both source and destination
URLs (also with regular expressions), the results (broken, ok, anchor not found,
redirected) and their type (normal, direct, redirected).
- Info regarding specific URLs (outgoing and ingoing links, last modify
datetime, etc ...
- Info regarding specific links (broken or ok) and the HTML instruction that
issued it
- Statistics on documents retrieved



ht://Check - DATABASE TABLES EXPLANATION
========================================

1) Link
-------

This table contains info on all the links that ht://Check found during the
crawl. Each link is identified by 4 fields, which make up the primary key:
the source URL, the destination URL, the tag position in the document and
the attribute position in the tag definition.

For example, let's suppose our first URL visited is http://www.foo.com/ and
it is as shown below (so simple, I hope you never write it this way):
<HTML>
<A href="http://htcheck.source.net/">
</HTML>

So, http://www.foo.com/ is identified as URL number 1 (IDUrl=1) whereas
http://htcheck.sourceforge.net/ has IDUrl=2.

The A tag has position number 2 in the document, and the attribute which
creates a link is the 'href' with position 1 in the tag.

So our link record's primary key is:
IDUrlSrc=1, IDUrlDest=2, TagPosition=2, AttrPosition=1.

Sometimes, when referencing a URL we use the so called anchors, by specifying
them after a '#' in the URL field. If that's been set, the anchor field of the
table contains that value.

The most interesting fields of the table are LinkType and LinkResult, which are
both enumeration fields. The LinkResult field is set only at the end of the
crawl, after all the URLs have been retrieved.

LinkType field can contain records with these cases:
- 'Normal' (a normal link, like the 'href' ones: this means you have to click
  before accessing it);
- 'Direct' (a direct link is a link that is downloaded, usually, automatically
  by agents, like for example images called by the <IMG src> HTML tag);
- 'Redirection': this is a special case, it's an unusual link, because it's
  an instance of the HTTP redirection, performed by the server (3xx status
  codes). So this kind of records don't have a TagPosition and an AttrPosition
  properly set (obviously, there's no HTML statement issuing this link).
 
LinkResult field can contain records with these cases:
- 'NotChecked': this is the default case, and it's issued when every Link record
  is created; only at the end of the crawl loop, this field can be set properly;
- 'NotRetrieved': the destination URL of the link has not been retrieved;
- 'OK': the link works perfectly. And the anchor, if present, works fine (only
  if the document has been retrieved, and not only checked wether it exists
  or not);
- 'Broken': the link is broken. The destination URL has not been found;
- 'Redirected': the destination URL has been redirected by the HTTP server;
- 'AnchorNotFound': the destination URL has been found and parsed, but
  the link anchor doesn't exist in it.
- 'NotAuthorized': you must have rights to access this URL, that is to say
  a valid user and a password for authentication (see 'authorization'
  attribute).
- 'EMail': that's an e-mail address reference.



The tables as of the 'mysqldump' program
========================================

Here follows the structure of the tables of the a typical ht://Check database,
as created by the <i>mysqldump<i> program. Please refer to the MySQL
documentation for more and further information. And if you find some useful
advice and suggestions to give me regarding the database (and of course
everything else) please come up tome with an e-mail! :-)

#
# Table structure for table 'Cookies'
#
CREATE TABLE Cookies (
  IDCookie mediumint(8) unsigned DEFAULT '0' NOT NULL,
  Name varchar(255) DEFAULT '' NOT NULL,
  Value text NOT NULL,
  Path varchar(255) DEFAULT '' NOT NULL,
  Domain varchar(255) DEFAULT '' NOT NULL,
  Expires datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  Secure tinyint(4) DEFAULT '0' NOT NULL,
  PRIMARY KEY (IDCookie)
);

#
# Table structure for table 'HtmlAttribute'
#
CREATE TABLE HtmlAttribute (
  IDUrl mediumint(8) unsigned DEFAULT '0' NOT NULL,
  TagPosition smallint(5) unsigned DEFAULT '0' NOT NULL,
  AttrPosition tinyint(3) unsigned DEFAULT '0' NOT NULL,
  Attribute varchar(32) DEFAULT '' NOT NULL,
  Content varchar(255),
  PRIMARY KEY (IDUrl,TagPosition,AttrPosition),
  KEY Idx_Attribute (Attribute(8))
);

#
# Table structure for table 'HtmlStatement'
#
CREATE TABLE HtmlStatement (
  IDUrl mediumint(8) unsigned DEFAULT '0' NOT NULL,
  TagPosition smallint(5) unsigned DEFAULT '0' NOT NULL,
  Tag varchar(32) DEFAULT '' NOT NULL,
  Statement varchar(255),
  PRIMARY KEY (IDUrl,TagPosition),
  KEY Idx_Tag (Tag(4))
);

#
# Table structure for table 'Link'
#
CREATE TABLE Link (
  IDUrlSrc mediumint(8) unsigned DEFAULT '0' NOT NULL,
  IDUrlDest mediumint(8) unsigned DEFAULT '0' NOT NULL,
  TagPosition smallint(5) unsigned DEFAULT '0' NOT NULL,
  AttrPosition tinyint(3) unsigned DEFAULT '0' NOT NULL,
  Anchor varchar(255) binary DEFAULT '' NOT NULL,
  LinkType enum('Normal','Direct','Redirection') DEFAULT 'Normal' NOT NULL,
  LinkResult enum('NotChecked','NotRetrieved','OK','Broken','AnchorNotFound','Redirected','NotAuthorized','EMail','Javascript') DEFAULT 'NotChecked' NOT NULL,
  PRIMARY KEY (IDUrlSrc,IDUrlDest,TagPosition,AttrPosition),
  KEY Idx_IDUrlDest (IDUrlDest),
  KEY Idx_Anchor (Anchor(8)),
  KEY Idx_LinkType (LinkType),
  KEY Idx_LinkResult (LinkResult)
);

#
# Table structure for table 'Schedule'
#
CREATE TABLE Schedule (
  IDUrl mediumint(8) unsigned DEFAULT '0' NOT NULL,
  IDServer smallint(5) unsigned DEFAULT '0' NOT NULL,
  Url varchar(255) binary DEFAULT '' NOT NULL,
  Status enum('ToBeRetrieved','Retrieved','CheckIfExists','Checked','BadQueryString','BadExtension','MaxHopCount','FileProtocol','EMail','Javascript','NotValidService','Malformed') DEFAULT 'ToBeRetrieved' NOT NULL,
  CreationTime datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  IDReferer mediumint(8) unsigned DEFAULT '0' NOT NULL,
  HopCount tinyint(3) unsigned DEFAULT '0' NOT NULL,
  PRIMARY KEY (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_Status (Status)
);

#
# Table structure for table 'Server'
#
CREATE TABLE Server (
  IDServer smallint(5) unsigned DEFAULT '0' NOT NULL,
  Server varchar(255) DEFAULT '' NOT NULL,
  IPAddress varchar(15),
  Port smallint(5) unsigned DEFAULT '0' NOT NULL,
  HttpServer varchar(255) DEFAULT '' NOT NULL,
  HttpVersion varchar(255) DEFAULT '' NOT NULL,
  PersistentConnection tinyint(1) unsigned DEFAULT '0' NOT NULL,
  Requests smallint(5) unsigned DEFAULT '0' NOT NULL,
  PRIMARY KEY (IDServer),
  KEY Idx_Server (Server(24)),
  KEY Idx_Requests (Requests)
);

#
# Table structure for table 'Url'
#
CREATE TABLE Url (
  IDUrl mediumint(8) unsigned DEFAULT '0' NOT NULL,
  IDServer smallint(5) unsigned DEFAULT '0' NOT NULL,
  Url varchar(255) binary DEFAULT '' NOT NULL,
  ContentType varchar(32) DEFAULT '' NOT NULL,
  ConnStatus enum('OK','NoHeader','NoHost','NoPort','NoConnection','ConnectionDown','ServiceNotValid','OtherError') DEFAULT 'OK' NOT NULL,
  TransferEncoding varchar(32) DEFAULT '' NOT NULL,
  LastModified datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  LastAccess datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  Size int(11) DEFAULT '0' NOT NULL,
  StatusCode smallint(6) DEFAULT '0' NOT NULL,
  ReasonPhrase varchar(32) DEFAULT '' NOT NULL,
  Location varchar(255) binary DEFAULT '' NOT NULL,
  Title varchar(255) DEFAULT '' NOT NULL,
  SizeAdd int(11) DEFAULT '0' NOT NULL,
  PRIMARY KEY (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_ContentType (ContentType(16)),
  KEY Idx_StatusCode (StatusCode)
);

#
# Table structure for table 'htCheck'
#
CREATE TABLE htCheck (
  StartTime datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  EndTime datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  ScheduledUrls mediumint(8) unsigned DEFAULT '0' NOT NULL,
  TotUrls mediumint(8) unsigned DEFAULT '0' NOT NULL,
  RetrievedUrls mediumint(8) unsigned DEFAULT '0' NOT NULL,
  TCPConnections mediumint(8) unsigned DEFAULT '0' NOT NULL,
  ServerChanges mediumint(8) unsigned DEFAULT '0' NOT NULL,
  HTTPRequests mediumint(8) unsigned DEFAULT '0' NOT NULL,
  HTTPSeconds mediumint(8) unsigned DEFAULT '0' NOT NULL,
  HTTPBytes mediumint(8) unsigned DEFAULT '0' NOT NULL,
  User varchar(255) DEFAULT '' NOT NULL,
  PRIMARY KEY (StartTime,EndTime)
);
