Creating a file content crawler with ColdFusion....

This tutorial will show you how to create a local file crawler that will enable you to find a specified document type (i.e. PDF files) within a directory (and it's children directories).

I want to begin by explaining a little bit about what a crawler is, some of you might be like... a what? :)

A crawler is a script that will basically return matching items you specify for it to find... I think the best example you can see is the actual code itself, so lets get started:

The first example will be a local file crawler, now what this does is this; say you have a directory structure that looks like this:

D:\websites\hello kitty\free_stuff.pdf

Now, notice that the PDF files are on all different types of folder under the D:\websites folder, so that will become the ROOT FOLDER.

<!--- define an empty variable that will become a list of directories
        to search later in the application --->

<cfset current_directory_to_crawl = "">

<!--- now by default define the root folder to search, in this example D:\websites\ --->
<cfset next_directory_to_crawl = "D:\websites\">

<!--- Now define a variable that will tell the application later on if it should continue
        At default set the value to 'one' --->

<cfset crawl_again = 1>

<!--- now define a variable that will count the number of files found and set it to 'zero' by default --->
<cfset file_counter = 0>

<!--- do >>ONLY<< one extension per run --->
<cfset extension_to_crawl = "pdf">

<!--- define a variable to hold the file names of the files found  --->
<cfset file_container = "">

<!--- create a container to hold all files processed (If you are wanting to move them elsewhere) --->
<cfset file_completed = "">

<!--- ok, here begin the processing because the variable
        crawl_again is set to 1 (stop when set to 0) --->

<cfloop condition="crawl_again neq 0">

    <!--- first switch the directory values --->
    <cfset current_directory_to_crawl = next_directory_to_crawl>

    <!--- now clear the next --->
    <cfset next_directory_to_crawl = "">

    <!--- Clear the file container --->
    <cfset file_container = "">

    <!--- Now loop through the list of directories to crawl and look for the extensions --->
    <cfloop list="#current_directory_to_crawl#" index="dir" delimiters="|">

        <!---- now list the directory contents --->
        <cfdirectory action="LIST"

            <!--- first get all the files --->
            <cfloop query="CurrentPull">

                <!---- process everything returned in the CFDIRECTORY with the exception of the first to records which are "." and "..". Those can be skipped for this example --->
                <cfif name neq "." OR name neq "..">

                <!--- display the current file/directory to the screen --->

                <!--- lets see if the current item is a file or directory --->
                <cfif type eq "dir">

                        <!--- Found a directory, set this folder as crawlable so on the next loop we can search it for PDF files --->
                        <cfset next_directory_to_crawl = ListAppend(next_directory_to_crawl, dir & name & "\", "|")>

                <cfelseif type eq "file">

                <!--- this is a file, see if the extension of the file is the one defined above --->
                    <cfif ListLast(name, ".") eq extension_to_crawl>
                        <!--- here is checks to make sure that this file and it's path is UNIQUE --->
                        <cfif NOT ListFind(file_completed, dir & name, "|")>

                            <!--- define this file are completed --->
                            <cfset file_completed = ListAppend(file_completed, dir & name, "|")>

                            <!--- add the file to the container --->
                            <cfset file_container = ListAppend(file_container, dir & name, "|")>

                            <!--- add one to the file counter --->
                            <cfset file_counter = file_counter + 1>





<!--- now output the final values to the screen so we can see them --->
<cfloop list="#next_directory_to_crawl#" index="folder" delimiters="|">
 <cfloop list="#file_container#" index="files" delimiters="|">
Files Found: #file_counter#<hr>

<cfif next_directory_to_crawl eq
      <!--- There are no more folders to crawl, stop the main loop --->
       <cfset crawl_again = 0>

That's pretty much it, that will make a local crawler to find files and much more!

Questions? Comments? Email Me....

About This Tutorial
Author: Pablo Varando
Skill Level: Intermediate 
Platforms Tested: CF5,CFMX
Total Views: 126,976
Submission Date: July 19, 2003
Last Update Date: June 05, 2009
All Tutorials By This Autor: 47
Discuss This Tutorial
  • (incorrect) should be AND not OR: (correct) The first statement causes the script to traverse up the directory tree. The second correctly traverses down the directory tree from your defined starting directory.


Sponsored By...
Healing Touch Massage - Dripping Springs, Texas - Deep Tissue Massage and Swedish Massage Services just $39 for a 50 minute massage!