ANET.at HomepageSearchEngine History ============================ 1. Chronology of version releases ------------------------------ __________ ____________ Date Version 2006-12-12 3.62 beta6+ 2006-10-31 3.61 2005-04-14 3.6 2004-09-28 3.53+ 2004-01-20 3.53 2004-01-11 3.52 2003-12-15 3.51 2003-10-24 3.5 2003-06-10 3.42 2003-03-27 3.41 2003-01-30 3.4+ 2002-09-24 3.4 2002-06-03 3.37 2002-01-30 3.36 2001-12-06 3.35 2001-11-04 3.34 2001-10-23 3.33 2001-07-04 3.32 2001-05-27 3.31 2001-03-29 3.3 2001-02-01 3.21 2000-12-14 3.2 2000-10-27 3.1 2000-09-03 3.03 2000-08-08 3.02 2000-07-17 3.01 2000-06-18 3.0 2000-03-05 2.07 2000-01-31 2.06 2000-01-18 2.05 2000-01-10 2.04a 1999-12-27 2.04 1999-10-25 2.03 1999-10-04 2.02 1999-09-27 2.01 1999-09-14 2.0 1999-09-06 1.01 1999-08-02 1.0 2. History of version changes ("change log") ----------------------------------------- ____________________________________________________________________________ v3.62 beta6+ (December 12, 2006) (What's new) + support for searching PDF files (indexed search method only). + new shell command "pdfconvert" which converts supported PDF files into plain text format and determines unsupported PDF files. + improvements to the "spider" shell command: + new option '-pdf2txt', which adds *.pdf PDF URLs to the URL-list if corresponding *.pdf.txt URLs to text files (as created by the "pdfconvert" command) exist. + properly handle the "noindex" directive of the robots meta-tag. The manual now contains a table of all possible index/follow actions and their corresponding robots meta-tag directives. + the '-querystrings' option accepts arguments to specify only those variable names, optionally with (wildcarded) values, that should be kept (and re-ordered) within query strings. + if a URL to be verified contains a query string, it will first be requested by HEAD, instead of making a full GET request. A GET request will follow only if the URL has been successfully verified to be 'text/html' content without an restricting robots meta-tag. + if a HEAD request unveils a URL to be redirected (status code 302), and the returned Content-Type is not of text/html, a GET request follows to determine the correct Content-Type. + Last-Modified header information will now always be printed once the URL passed verification. + if the URL-list file contains *.pdf URLs, the "geturls" shell command automatically gets the corresponding *.pdf.txt URLs, in order to support URL-lists created by the "spider" command with the '-pdf2txt' option set. + the "makelist" shell command automatically adds *.pdf.txt files to the list of Non-HTML files, as long as the corresponding *.pdf files exist or if they have been catched by the "geturls" command. + the "index" shell command automatically adds *.pdf.txt files to the list of Non-HTML files, as long as the corresponding *.pdf files exist or if they have been catched by the "geturls" command. + both shipped cronjob sample scripts for Unix and Windows have been updated to include PDF support when building the search index. + the "Search text of Non-HTML files" checkbox area in the advanced built-in search box of Pro editions now also displays an icon to indicate support of PDF files. + a built-in icon to be displayed on the results page for PDF documents is available. The default icon image for a PDF document can be changed via the new "imgsrc_pdfdoc" directive of the hse.ini file. + the results details for a PDF document include the PDF version, displayed when mousing over the PDF icon and after the file size. + the built-in icon for RTF documents has been changed to clearly stand for RTF. + the help window available from the built-in search box now looks nicer and is made of valid XHTML. Its elements can be designed via the HomepageSearchEngine.css style sheet. + the shipped HomepageSearchEngine.css Style Sheet has been updated and now includes a definition for the help window's background-color. + when a result head should print the title, but there is no title available in the result file, the file's name (and not the file's entire URL) will be printed. + ensure that result URLs may only be passed through the "highlightmatches" filter, when their Content-Type is either text/html or text/plain. Open all other URLs by redirecting them directly to the document. + when a result URL will directly redirect to the document, don't do the redirection by sending a "Location:" header. Instead, create a new HTML document that does the redirection. This ensures Internet Explorer not to show a blank page in a new window instead of opening a RTF- or PDF-file with their associated application + add "#page=n" to the direct URLs of PDF-files, to open supported PDF files at the specified page number n (currently, always equals to 1). + new hse.ini directive "notarget_list" to specify a list of URLs where the "target" directive should be ignored + glitch corrected that went into HSE 3.61: query links to global search engines (set via hse.ini's "query-links" directive) contained the entire query string instead only the actual terms to be queried + the distributed packages for Unix platforms are now BZip2 compressed TAR archives (.tbz2 files), rather than GNU-Zip compressed TAR archives (.tgz files), to reduce file sizes. Under Windows, they can be unpacked with WinZip as of version 11.0 ____________________________________________________________________________ v3.61 (October 31, 2006) + access to certain configuration sets can be disallowed by using a dynamic HTML template in those conf sets, having a "HSE-Disallow-Conf" meta-tag set to "true". Thus, adding to a dynamic template will disallow HSE accessing the conf set calling that template. Both the shipped PHP and SSI enabled sample template files have been updated to include this meta-tag, set to "false" per default (allowing access). + when a dynamic HTML template is used, the number of HSE's configuration set (as defined via the "conf" query variable) is available as the "HTTP_HSE_CONF" HTTP Header Environment Variable. Both the shipped PHP and SSI enabled sample template files have been updated to show how to obtain this value. This may be useful to conditionally disallow access to certain conf sets (as explained above), especially within a PHP based user access system. + user properties (like the user's IP address and the value of a specified cookie) can be provided to the environment of the dynamic HTML template. This is useful in a cookie based user access system written in PHP. + the "geturls" command is working properly with URLs containing Authorization data (such as 'http://username:password@www.somesite.tld/' or - without password - 'http://username@www.somesite.tld/') + when the "changeurls" command restores URLs in index files and the URL found in the URL-list file contains Authorization data, the Authorization data will not be copied into the index files + when HomepageSearchEngine prints a date which format is not configurable (as output from the Shell Executable), the format is now always in ISO 8601 compliant international standard date notation (YYYY-MM-DD) + support for less used platforms has been discontinued. The platforms Win32, Linux, FreeBSD, Solaris and MacOSX are still supported and will be so in future releases. This allows us to focus on much more important things rather than wasting time with support of platforms that probably noone needs. If you really need a version for a previously supported platform, contact us. ____________________________________________________________________________ v3.6 (April 14, 2005) + the "hse.ini" configuration file has been reorganized + "protection_time" hse.ini directive added to protect the CGI application against overloading. + the directives "formtable_width", "formtable_border-color", "formtable_background-color" and "formtable_background-image" have been removed. Instead, the search box's general outfit is now defined by the HomepageSearchEngine.css Style Sheet's ".HSE-searchbox" properties. + the "formtable_input-size" directive has been removed. Instead, the style of the search box's input text field is now defined by the Style Sheet's ".HSE-inputtext" properties. + where to place the search box is now controlled by the "searchbox_place" directive, allowing it at the top or/and at the bottom, or none. + the "formtable_alignment" directive has been replaced by "searchbox_align" which defaults to the value "auto" + the "results_details" directive's keyword regarding the icon has been simplified: "icon:custom16x16" (to show it in the size of 16 x 16 pixel) has been removed because its style is defined by the Style Sheet (".HSE-icon" properties). The default keyword is now just "icon", which will show a custom icon if present, or otherwise print a default icon. + "imgsrc_webpage", "imgsrc_rtfdoc" and "imgsrc_textfile" directives have been added to specify source URLs of default icon images + the directive "results_descriptions" has been renamed to "description" + the directives "results_previous_img" and "results_next_img" have been removed. Instead, the source URLs of the "previous" and "next" images within the navigation panel below the results are now specified by the directives "imgsrc_previous" and "imgsrc_next". The style of these images is defined by the Style Sheet's ".HSE-nav-image" properties. + the style of the currently displayed results range (within the navigation panel below the results) is now defined by the Style Sheet's ".HSE-current-range" properties. + number of possible categories increased to 99 + Queries file has been updated: 'A9.com' added, less important search engines removed, some URLs corrected + the OpenSSL packages for the platforms that support accessing https URLs (Windows 32bit, GNU/Linux, FreeBSD, Sun Solaris and WindRiver BSD/OS) have been updated to the latest OpenSSL version (0.9.7e) ____________________________________________________________________________ v3.53+ (September 28, 2004) + when a results URL is shown using the "highlightmatches" feature, the UTF-8 flag is only turned on if the restrictive search options are set to 'matchcase=off' or 'noparts=on' and hse.ini's "utf8" directive is set to "on" and active. + If a custom form contains a "uft8" parameter (to override the usual rules of applying the UTF-8 flag), it can be set to "off" (to never apply the UTF-8 flag), to "on" (to always apply it) or to "auto" (to only apply it on 'matchcase=off' or 'noparts=on' searches) + URLs with an unsupported Character Encoding that are shown using the "highlightmatches" feature or the "URL Header inspector" fall back to the default Character Encoding + fixed: on some rare webservers such as Rapidsite/Apa an Error 500 may have occured + fixed: no results were printed when the "results_global" directive did not contain the "summary" keyword and more results have been found then specified with the "max_found_files" directive + the '-cat' option is now mandatory for the "spider" and "geturls" commands + the "spider" command now gives an error when the directory to be prepared for the "geturls" command is not below the "basepath" directory. + HomepageSearchEngine used as spider or HTTP client also recognizes "Last-Modified" headers in uncommon formats such as "Wed Sep 15 08:38:21 2004 GMT" or "Wednesday, 15-Sep-04 08:38:21 GMT". The "Last-Modified" value will be printed along with the corresponding time in ISO standardized format, such as "2004-09-15, 08:38:21" + the "geturls" command now skips getting an URL when the associated file name to be saved as exceeds the length of 250 characters + the style of the icon printed before each result can be customized more effectively (using the the Style Sheet's additional ".HSE-icon-div" definition) ____________________________________________________________________________ v3.53 (January 20, 2004) + fixed: determination of the Executable's directory works properly on all Windows platforms (there may have been problems with version 3.52 on some Windows systems) + fixed: forcing the UTF-8 flag to be on by delivering a "uft8" parameter now also works on search methods other than the indexed search. + fixed: the 'WindRiver BSD/OS' package now works on all BSD/OS 4.x versions + some Style Sheet improvements for nice dynamic CSS2 features on form elements __________________________________________________________________________ v3.52 (January 11, 2004) + further spider improvements: + the directory containing the previously grabbed files will now be cleaned up, so the cronjob script does not need to use system commands for that anymore. This behaviour can be unset by applying the new '-nocleanup' option. + already existing files with the same Last-Modified date as the remote URLs will now be provided to be used as cache. + the '-noquerystrings' option has been replaced by the '-querystrings' option. The former '-noquerystrings' behaviour is now set by default and can be unset by applying '-querystrings'. + the "geturls" command now uses the cache, unless the new '-nocache' option is set. + the default Character Encoding is not limited to byte based Encodings (previously called "character sets"), now supporting Unicode Encodings (such as "utf-8") as well. Consequently, the name of the "charset" directive has been changed to "encoding". + the default Character Encoding is now checked to be W3C compliant. If the check fails, a message will be printed, containing a list of all supported W3C approved Encoding names + the search page's Encoding is now sent directly as HTTP Header instead of as a meta-tag within the HTML content + the "URL Header inspector" now also prints the URL's Character Encoding + when a results URL is shown using the "highlightmatches" feature, the original Character Encoding is preserved + enabling case-insensitive searches on Non-ASCII characters is now done via the new "utf8" directive. It replaces the previously used "locale" directive. If enabled, a so-called "UTF-8 flagged search" takes place. + when an UTF-8 flagged search took place, that can be identified by mousing over the "Required time:" value on top of the results page (unless displaying of time has been disabled). + for optimised speed, searches with default restrictive options ('matchcase=on' and 'noparts=off') automatically turn off the UTF-8 flag, since it is not needed in that cases + within the found files (with the highlighted matches), the search will always take place UTF-8 flagged, since the overhead is not significant + the UTF-8 flag can always be forced being on or off, overriding the usual rules, by manually delivering a "uft8" parameter with the value "on" or "off". This may be useful for testing the performance difference. + UTF-8 flagged case-insensitive searches now performe better + restoring the search terms into a custom input form via the "fill_form()" JavaScript function (included in 'hse_customform.js') can now be forced to always work properly, also with Character Encodings that previously may have produced garbage on some characters. To do so, deliver a parameter called "encterms". That will add the terms in the "encodeURIComponent" format, as "enc" delivery parameter, to the query string. + support for https URLs on the WindRiver BSD/OS platform ____________________________________________________________________________ v3.51 (December 15, 2003) + additional license models introduced: "Wildcard Site license" and "Host license" (a global, machine-based license key) - see "license.txt" for details + all commands accept different values given to the '-debug' option, to specify different verbose levels + spider improvements: + supports URLs that require Authorization + added '-prerobotsfile[=FILE]' option to add custom robot rules to those defined in the site´s /robots.txt file. To be verbose (only) about robot rules related behaviour, you can set the '-debug=robotrules' option. + accepts uncorrect syntax within the robot rules, when the command doesn't end with ":" (such as "User-agent " instead of "User-agent: ") + accepts (uncorrect) "\" characters in found links to be "/", acting as directory separator + cleans up found links containing "/./" or "/../" + the "geturls" command now stores each grabbed file with the Last-Modified timestamp determined from the remote URL + lines in the "hse.ini" file can be continued on the following one by ending them with ' \' + fixed: "Autocomplete indexing" could not work properly when used without a '-cat' option + fixed: in some circumstances, a custom value of the "results_details" directive was not determined properly + rewritten, enhanced cronjob scripts ("hse_cronjob.sh" and "hse_cronjob.bat"), containing a well documented example how to spider some sites + Queries file ("hse_queries.ini") is now pre-configured to enable a query to Google in several languages + updated documentation, both the Manual's one and within "hse.ini" + WindRiver (formerly BSDi) BSD/OS platform is supported ____________________________________________________________________________ v3.5 (October 24, 2003) + new libraries bring these advantages: + support for https URLs (SSL enabled pages) on FreeBSD, GNU/Linux, Sun Solaris and Windows 32bit platforms + converting character sets does not require iconv (GNU libiconv / GNU libc) anymore + shared object files are now residing in the executable's "lib" sub directory, to make things more easy to survey + shared object files have the "so" extension on all platforms (including Windows and Mac), to make things more unique + no "locale-enabled HSE Executable" is required anymore to solve the "always-case-sensitive" bug affecting Non-ASCII characters. Now, the search string and the searched text are treated directly as encoded in the character set specified by the "charset" directive. To avoid speed decreases on case-insensitive searches, this feature only applies on indexed searches and must be enabled using the "locale" directive (which now behaves differently than previously). + "totalmatches" keyword added to the "results_global" directive + a new, more flexible directive "query-links" replaces the "engine-links" part of the "results_global" directive + global search engines to be queried can be fully customized via a central Queries file ("hse_queries.ini") + "rank", "head:title (T)" or "head:description"; "description", "nobr", "print:'...'" and "link:'...'" keywords added to the "results_details" directive + the style of the ranking number before each result can be customized via the Style Sheet (using the ".HSE-rank" definition) + to get the most relevant results in the first search, the pre-built input form now defaults to "matchcase=on" + design of the pre-built input form slightly simplified + fixed: under IIS 6 (on Windows 2003), determination of the Executable's directory may not have worked properly + fixed: when the highlightmatches feature is turned on to highlight the matches in the result files, newlines within and
...tags will be preserved properly (that may have broken some JavaScript functionality) + fixed: wildcard symbols at word boundaries have not been treated properly + fixed: the result description for a RichTextFormat document may have included some garbage characters + the spider now only accepts a /robots.txt file when its MIME type is set properly (to 'text/plain') + it is now possible to spider an unlimited number of URLs (by running the "spider" command with the '-max=-1' option) + restoring Japanese terms into a custom input form via the "fill_form()" JavaScript function (included in 'hse_customform.js') now also works properly using Apple's Safari browser + language files slightly revised + language support for Croatian (language code "hr") + updated Czech language files (language code "cs") to work properly with the "iso-8859-2" character set + discontinued support for the platform BSDi BSD/OS ____________________________________________________________________________ v3.42 (June 10, 2003) + File-list files are now sorted by date of last modification, to ensure the latest modified files to be searched first. The sorting method can be changed by the new '-sort=date|name|none' option of the "makelist" command. + Incremental indexing (using the "index" command's new '-part=PART[/TOTALPARTS]' option) allows to index a large amount of files via the web based console, which would probably fail when tried to index in one step. + "Autocomplete indexing" feature allows easy index creation using the web based Admin Area, with just one click. All required actions will be performed automatically, such as switching to incremental indexing if necesarry. + the style of the icon printed before each result can be customized via the Style Sheet (using the ".HSE-icon" definition) + When the search source for an indexed search will be listed (by searching for "list:files"), the day of last modification of each file is printed (in addition to its URL and file size). This makes it possible to easily ensure if the file-list has been sorted by date of last modification. + When the search source for an on-the-fly search will be listed (by searching for "list:files"), the files appear in the order the real search takes place. That list can be displayed in alphabetical order instead by selecting "the path name" from the "Show ... hits ... sorted by" drop down menu of the pre-built input form. + all commands can be used with the '-debug' option, to enable verbose mode + HTML files will not be searched and indexed if they contain a "robots" meta tag with the content "noindex" or "none" (unless it contains "search"). These skipped files can be viewed when the "index" command is used with the '-debug' option. + improved Admin Area (with a Session ID based, server-sided User authentification system) ____________________________________________________________________________ v3.41 (March 27, 2003) + spider improvements: + supports the /robots.txt Robots Exclusion Protocol (with "HomepageSearchEngine" as the robot name) + supports "robots" meta tags (content="noindex, nofollow, none") + '-noquerystrings' option added to cut query strings from links + directory URLs will only be recognized once, regardless if they have a trailing slash or not + added '-prefix' option for the "changeurls" command to directly set the local start URL to be prefixed + to be more flexible, the strings in the "ban_list", "search_always" and "categories_sourceNR" directives are now *not* wildcarded at the beginning and at the end. To do so, the "*" wildcard symbol must be added at the desired positions. + ASP (Active Server Page) pages are also recognized with the .aspx extension (as created by ASP.NET) + when a required shared object is missing, the error message should point to the object in question instead of just printing an "Unknown error" + the design of results output has been changed slightly to be similar to that currently known from Google and AltaVista + the Free version (formerly Light edition) is now more configurable (see http://free.HomepageSearchEngine.com) ____________________________________________________________________________ v3.4+ (January 30, 2003) + libraries replaced by newer versions in order to support new UTF-8 related features + a custom input form that preserves the previous form settings now also restores Japanese terms properly using IE and Opera + automatic determination of the CGI executable's URL now also works properly on special server environments (eg. with sbox) + language code for Japanese changed from mistakenly used "jp" to the proper ISO 639-1 code "ja" + problem solved that may have occured when trying to access the admin area via the https protocol + added IANA character set information to each "WhatsThis.txt" file residing in the language directories + if platform detection by the shell script ("platform.cgi") fails, the Perl script "platform.pl" should do its job + discontinued support for the platform GNU/Linux-mips ____________________________________________________________________________ v3.4 (September 24, 2002) + new shell command "spider" which can spider an entire site and makes the URL-list file to grab URLs automatically + flat (file-list based) search method introduced which combines a semi-on-the-fly and a semi-indexed search method + indexing process splitted into two steps in order to consume less resources in each process: making of the file-list (by the new "makelist" command) and creating of the index files (by the "index" command) + Unicode support for all ASCII and Latin-1 characters, in both hexadecimal and decimal notations + alternative "locale-enabled" HSE executable added that supports the new "locale" directive to solve the "always-case-sensitive" bug affecting Non-US-ASCII characters on English based systems + added "title (T)" key word for the "results_details" directive to disable or customize printing of the title of a web page before each listing of a found file on the results page + added "maxsize:SIZE" key word for the "results_href" directive to automatically disable the highlightmatches/gotofirstmatch feature on target files with a size higher than specified, in order to reduce memory consumption + Extensible Markup Language (XML) files (.xml) are recognized as web pages + Wireless Markup Language (WML) files (.wml) are recognized as web pages + enhanced support for named entities: now 101 forms are supported, including the ones for: + all the 96 Latin-1 characters + the HTML ASCII characters (", &, < and >) and the Euro sign (€) + '&' in URLs to be printed will be converted into the entity to prevent strings to be interpreted as entities + fixed bug regarding '<' and '>' in titles + fixed bug regarding highlighting of more than one search terms in the "Google-like" style of the results' descriptions + improvements when the highlightmatches feature is turned on to highlight the matches in the result files: + a newline that separates two words of a matching phrase will be ignored + documents with the text/xml MIME type will be printed directly by the browser (and not parsed through HSE) + all ' ' named entities replaced by its Unicode equivalent (' ') in order to be XML compliant + all '@' characters replaced by its Unicode equivalent ('@') to make eMail addresses harder to find by spam robots + Helper Application Perl CGI script "passurl.cgi" included to redirect the result URLs to a custom application + "hse_customform.js" JavaScript library now works properly with Mozilla 1 and Opera 6 + in the FreeBSD package, a wrapper script has been added to achieve compatibility with all FreeBSD 4.x installations + when the "lang" delivery parameter is set to "de", the mouseover titles on the pre-built input form are printed in German + character set for Thai (language code "th") changed from "windows-874" to "tis-620" + language code for simplified Chinese (Chinese/China) changed from "zh" to "zh-cn" ____________________________________________________________________________ v3.37 (June 3, 2002) + added "icon:default|custom16x16|custom" key words for the "results_details" configuration directive to show an icon image with each result + the "sort" parameter can now also be "name" to sort the search results by the name of the file path + the value of an "append" delivery parameter will be appended to the result URLs in order to support dynamic shopping carts + HTML sections to be excluded from being searched can now also be spanned using the