【71-75】
One of the difficulties in building an SQL-like query language for the Web is theabsence of a database (71)for this huge,heterogeneous repository of information.However,if we are interested in HTML documents only,we can construct a virtual schemafrom the implicit structure of these files. Thus,at the highest level of (72) ,every such document is identified by its Uniform Resource Locator (URL),has a (73)and a text. Also,Web servers provide some additional information such as the type,length, and the last modification date of a document. So,for data mining purposes,we can consider the set of all HTML documents as a relation: Document(url, title,text,type,length,modif) Where all the (74) are characterstrings. In this framework,an individual document is identified with a (75) in thisrelation. Of course,if some optional information is missing from the HTML document,the associate fields will be left blank,but this is not uncommon in any database.