Creating a file content crawler with ColdFusion....

This tutorial will show you how to create a local file crawler that will enable you to find a specified document type (i.e. PDF files) within a directory (and it's children directories).

I want to begin by explaining a little bit about what a crawler is, some of you might be like... a what? :)

A crawler is a script that will basically return matching items you specify for it to find... I think the best example you can see is the actual code itself, so lets get started:

The first example will be a local file crawler, now what this does is this; say you have a directory structure that looks like this:

D:\websites\information.pdf
D:\websites\account_info.pdf
D:\websites\mysite.com\info.pdf
D:\websites\hello kitty\free_stuff.pdf

Now, notice that the PDF files are on all different types of folder under the D:\websites folder, so that will become the ROOT FOLDER.

<!--- define an empty variable that will become a list of directories
        to search later in the application --->

<cfset current_directory_to_crawl = "">

<!--- now by default define the root folder to search, in this example D:\websites\ --->
<cfset next_directory_to_crawl = "D:\websites\">

<!--- Now define a variable that will tell the application later on if it should continue
        At default set the value to 'one' --->

<cfset crawl_again = 1>

<!--- now define a variable that will count the number of files found and set it to 'zero' by default --->
<cfset file_counter = 0>

<!--- do >>ONLY<< one extension per run --->
<cfset extension_to_crawl = "pdf">

<!--- define a variable to hold the file names of the files found  --->
<cfset file_container = "">

<!--- create a container to hold all files processed (If you are wanting to move them elsewhere) --->
<cfset file_completed = "">

<!--- ok, here begin the processing because the variable
        crawl_again is set to 1 (stop when set to 0) --->

<cfloop condition="crawl_again neq 0">

    <!--- first switch the directory values --->
    <cfset current_directory_to_crawl = next_directory_to_crawl>

    <!--- now clear the next --->
    <cfset next_directory_to_crawl = "">

    <!--- Clear the file container --->
    <cfset file_container = "">

    <!--- Now loop through the list of directories to crawl and look for the extensions --->
    <cfloop list="#current_directory_to_crawl#" index="dir" delimiters="|">

        <!---- now list the directory contents --->
        <cfdirectory action="LIST"
                         directory=
"#dir#"
                         name=
"CurrentPull">

            <!--- first get all the files --->
            <cfloop query="CurrentPull">

                <!---- process everything returned in the CFDIRECTORY with the exception of the first to records which are "." and "..". Those can be skipped for this example --->
                <cfif name neq "." OR name neq "..">

                <!--- display the current file/directory to the screen --->
                <cfoutput>#name#<BR></cfoutput>

                <!--- lets see if the current item is a file or directory --->
                <cfif type eq "dir">

                        <!--- Found a directory, set this folder as crawlable so on the next loop we can search it for PDF files --->
                        <cfset next_directory_to_crawl = ListAppend(next_directory_to_crawl, dir & name & "\", "|")>

                <cfelseif type eq "file">

                <!--- this is a file, see if the extension of the file is the one defined above --->
                    <cfif ListLast(name, ".") eq extension_to_crawl>
                        <!--- here is checks to make sure that this file and it's path is UNIQUE --->
                        <cfif NOT ListFind(file_completed, dir & name, "|")>

                            <!--- define this file are completed --->
                            <cfset file_completed = ListAppend(file_completed, dir & name, "|")>

                            <!--- add the file to the container --->
                            <cfset file_container = ListAppend(file_container, dir & name, "|")>

                            <!--- add one to the file counter --->
                            <cfset file_counter = file_counter + 1>

                        </cfif>
                </cfif>

            </cfif>

      </cfif>
</cfloop>

</cfloop>
 

<!--- now output the final values to the screen so we can see them --->
<cfoutput>
      <hr><ol>
      
<cfloop list="#next_directory_to_crawl#" index="folder" delimiters="|">
          <li>#folder#</li>
       </cfloop>
       </ol>
       <hr><ol>
     
 <cfloop list="#file_container#" index="files" delimiters="|">
           <li>#files#</li>
        </cfloop>
        </ol>
       <HR>
Files Found: #file_counter#<hr>
</cfoutput>

<cfif next_directory_to_crawl eq
"">
      <!--- There are no more folders to crawl, stop the main loop --->
       <cfset crawl_again = 0>
</cfif>
</cfloop>

That's pretty much it, that will make a local crawler to find files and much more!

Questions? Comments? Email Me....

About This Tutorial
Author: Pablo Varando
Skill Level: Intermediate 
 
 
 
Platforms Tested: CF5,CFMX
Total Views: 128,426
Submission Date: July 19, 2003
Last Update Date: June 05, 2009
All Tutorials By This Autor: 47
Discuss This Tutorial
  • (incorrect) should be AND not OR: (correct) The first statement causes the script to traverse up the directory tree. The second correctly traverses down the directory tree from your defined starting directory.

Advertisement

Sponsored By...
Mobile App Development (IOS, Android, Cordova, Phonegap, Objective-C, Java) - Austin, Texas Mobile Apps - Touch512, LLC.