Indexing Web Pages |
Write a program to create an index of a small collection of World Wide Web pages. Each ``page" is a text file in a special format called HTML (HyperText Markup Language). The HTML format includes regular text and special HTML commands, which are always enclosed in anglebraces. For example, the string <A HREF="layout.htm"> is an HTML command meaning that the following text should be highlighted; a user click on the highlighted text would cause a web browser to fetch and display the file layout.htm.
Your program's job is to read an HTML file called index.htm and all the files referenced within index.htm by the HREF command and all the files referenced by those files, and so on until there are no new files to read. Your program should also read the file webpage.in containing a list of words and show a list of all the files referenced from index.htm which contain each word (see the Sample Output).
The initial HTML file you should start indexing will be named index.htm. Next the other files, including webpage.in, with a single blank line separating each listing. The words in webpage.in will be placed one word per line, with no additional spaces.
List each word in the standard input file, followed by a list of the file names it is found in, in the following format:
"word" can be found in the following pages: filename1 filename2 "word" can be found in the following pages: filename3 "word" can not be found in any page.
Where word is the word from the input file, and filename1, filename2, and so on, are the names of the files containing the word. Each file name should be indented five spaces: a single blank line should separate each listing.
<HTML> <HEAD> <TITLE>Indexing Web Pages</TITLE> </HEAD> <BODY> <P>Write a program to create an index of a small collection of World Wide Web pages. Each "page" is a text file in a special format called HTML (HyperText Markup Language). The HTML format includes regular text and special HTML commands, which are always enclosed in angle braces. For example, the string <A HREF="layout.htm"> is an HTML command meaning that the following text should be highlighted; a user click on the highlighted text would cause a web browser to fetch and display the file layout.htm.</P> <H1>Following Links</H1> <P>Don't forget that links can be <A HREF="index.htm"> self-referential</A>!</P> </BODY> </HTML> <A bunch of gibberish and a word> Note that there is no rule that the file needs to be legal HTML (if you know the rules), or that words really be wordseiwlaoieu;a. <A HREF="index.htm">Watch out for mutual references! </HTML> file index html HTML recursion word is
"file" can be found in the following pages: index.htm layout.htm "index" can be found in the following pages: index.htm "html" can be found in the following pages: index.htm layout.htm "HTML" can be found in the following pages: index.htm layout.htm "recursion" can not be found in any page. "word" can not be found in any page. "is" can be found in the following pages: index.htm layout.htm