Creating a file content crawler with ColdFusion....

Creating a file content crawler with ColdFusion....

This tutorial will show you how to create a local file crawler that will enable you to find a specified document type (i.e. PDF files) within a directory (and it's children directories).

I want to begin by explaining a little bit about what a crawler is, some of you might be like... a what? :)

A crawler is a script that will basically return matching items you specify for it to find... I think the best example you can see is the actual code itself, so lets get started:

The first example will be a local file crawler, now what this does is this; say you have a directory structure that looks like this:

D:\websites\information.pdf
D:\websites\account_info.pdf
D:\websites\mysite.com\info.pdf
D:\websites\hello kitty\free_stuff.pdf

Now, notice that the PDF files are on all different types of folder under the D:\websites folder, so that will become the ROOT FOLDER.

<!--- define an empty variable that will become a list of directories
        to search later in the application --->

<cfset current_directory_to_crawl = "">

<!--- now by default define the root folder to search, in this example D:\websites\ --->
<cfset next_directory_to_crawl = "D:\websites\">

<!--- Now define a variable that will tell the application later on if it should continue
        At default set the value to 'one' --->

<cfset crawl_again = 1>

<!--- now define a variable that will count the number of files found and set it to 'zero' by default --->
<cfset file_counter = 0>

<!--- do >>ONLY<< one extension per run --->
<cfset extension_to_crawl = "pdf">

<!--- define a variable to hold the file names of the files found  --->
<cfset file_container = "">

<!--- create a container to hold all files processed (If you are wanting to move them elsewhere) --->
<cfset file_completed = "">

<!--- ok, here begin the processing because the variable
        crawl_again is set to 1 (stop when set to 0) --->

<cfloop condition="crawl_again neq 0">

    <!--- first switch the directory values --->
    <cfset current_directory_to_crawl = next_directory_to_crawl>

    <!--- now clear the next --->
    <cfset next_directory_to_crawl = "">

    <!--- Clear the file container --->
    <cfset file_container = "">

    <!--- Now loop through the list of directories to crawl and look for the extensions --->
    <cfloop list="#current_directory_to_crawl#" index="dir" delimiters="|">

        <!---- now list the directory contents --->
        <cfdirectory action="LIST"
                         directory=
"#dir#"
                         name=
"CurrentPull">

            <!--- first get all the files --->
            <cfloop query="CurrentPull">

                <!---- process everything returned in the CFDIRECTORY with the exception of the first to records which are "." and "..". Those can be skipped for this example --->
                <cfif name neq "." OR name neq "..">

                <!--- display the current file/directory to the screen --->
                <cfoutput>#name#<BR></cfoutput>

                <!--- lets see if the current item is a file or directory --->
                <cfif type eq "dir">

                        <!--- Found a directory, set this folder as crawlable so on the next loop we can search it for PDF files --->
                        <cfset next_directory_to_crawl = ListAppend(next_directory_to_crawl, dir & name & "\", "|")>

                <cfelseif type eq "file">

                <!--- this is a file, see if the extension of the file is the one defined above --->
                    <cfif ListLast(name, ".") eq extension_to_crawl>
                        <!--- here is checks to make sure that this file and it's path is UNIQUE --->
                        <cfif NOT ListFind(file_completed, dir & name, "|")>

                            <!--- define this file are completed --->
                            <cfset file_completed = ListAppend(file_completed, dir & name, "|")>

                            <!--- add the file to the container --->
                            <cfset file_container = ListAppend(file_container, dir & name, "|")>

                            <!--- add one to the file counter --->
                            <cfset file_counter = file_counter + 1>

                        </cfif>
                </cfif>

            </cfif>

      </cfif>
</cfloop>

</cfloop>
 

<!--- now output the final values to the screen so we can see them --->
<cfoutput>
      <hr><ol>
      
<cfloop list="#next_directory_to_crawl#" index="folder" delimiters="|">
          <li>#folder#</li>
       </cfloop>
       </ol>
       <hr><ol>
     
 <cfloop list="#file_container#" index="files" delimiters="|">
           <li>#files#</li>
        </cfloop>
        </ol>
       <HR>
Files Found: #file_counter#<hr>
</cfoutput>

<cfif next_directory_to_crawl eq
"">
      <!--- There are no more folders to crawl, stop the main loop --->
       <cfset crawl_again = 0>
</cfif>
</cfloop>

That's pretty much it, that will make a local crawler to find files and much more!

Questions? Comments? Email Me....

All ColdFusion Tutorials By Author: Pablo Varando
Download the EasyCFM.COM Browser Toolbar!