Creating a file content crawler with ColdFusion....
This tutorial will show you how to create a
local file crawler that will enable you to find a specified document type (i.e.
PDF files) within a directory (and it's children directories).
I want to begin by explaining a little bit
about what a crawler is, some of you might be like... a what? :)
A crawler is a script that will basically
return matching items you specify for it to find... I think the best example you
can see is the actual code itself, so lets get started:
The first example will be a local file crawler,
now what this does is this; say you have a directory structure that looks like
this:
D:\websites\information.pdf
D:\websites\account_info.pdf
D:\websites\mysite.com\info.pdf
D:\websites\hello kitty\free_stuff.pdf
Now, notice that the PDF files are on all
different types of folder under the D:\websites folder, so that will become the
ROOT FOLDER.
<!--- define an empty
variable that will become a list of directories
to search later in the application
--->
<cfset current_directory_to_crawl =
"">
<!--- now by default
define the root folder to search, in this example D:\websites\ --->
<cfset next_directory_to_crawl =
"D:\websites\">
<!--- Now define a
variable that will tell the application later on if it should continue
At default set the value to 'one'
--->
<cfset crawl_again = 1>
<!--- now define a
variable that will count the number of files found and set it to 'zero' by
default --->
<cfset file_counter = 0>
<!--- do >>ONLY<< one
extension per run --->
<cfset extension_to_crawl = "pdf">
<!--- define a
variable to hold the file names of the files found --->
<cfset file_container = "">
<!--- create a
container to hold all files processed (If you are wanting to move them
elsewhere) --->
<cfset file_completed = "">
<!--- ok, here begin the processing because the
variable
crawl_again is set to 1 (stop when
set to 0) --->
<cfloop condition="crawl_again
neq 0">
<!--- first switch the directory
values --->
<cfset current_directory_to_crawl =
next_directory_to_crawl>
<!--- now clear the next --->
<cfset next_directory_to_crawl =
"">
<!--- Clear the file container --->
<cfset file_container =
"">
<!--- Now loop through the list of
directories to crawl and look for the extensions --->
<cfloop list="#current_directory_to_crawl#"
index="dir"
delimiters="|">
<!---- now list the directory contents --->
<cfdirectory
action="LIST"
directory="#dir#"
name="CurrentPull">
<!--- first get all the files --->
<cfloop query="CurrentPull">
<!---- process everything returned in the CFDIRECTORY
with the exception of the first to records which are "." and "..". Those can be
skipped for this example --->
<cfif name neq "."
OR name neq "..">
<!--- display the current file/directory to the screen
--->
<cfoutput>#name#<BR></cfoutput>
<!--- lets see if the current item is a file or
directory --->
<cfif type eq "dir">
<!--- Found a directory, set this folder as crawlable
so on the next loop we can search it for PDF files --->
<cfset next_directory_to_crawl =
ListAppend(next_directory_to_crawl, dir & name &
"\", "|")>
<cfelseif type eq "file">
<!--- this is a file, see if the extension of the file
is the one defined above --->
<cfif ListLast(name,
".") eq extension_to_crawl>
<!--- here is checks to make sure that this file and
it's path is UNIQUE --->
<cfif NOT ListFind(file_completed, dir & name,
"|")>
<!--- define this file are completed --->
<cfset file_completed = ListAppend(file_completed, dir &
name, "|")>
<!--- add the file to the container --->
<cfset file_container = ListAppend(file_container, dir &
name, "|")>
<!--- add one to the file counter --->
<cfset file_counter = file_counter + 1>
</cfif>
</cfif>
</cfif>
</cfif>
</cfloop>
</cfloop>
<!--- now output the
final values to the screen so we can see them --->
<cfoutput>
<hr><ol>
<cfloop list="#next_directory_to_crawl#"
index="folder"
delimiters="|">
<li>#folder#</li>
</cfloop>
</ol>
<hr><ol>
<cfloop list="#file_container#"
index="files"
delimiters="|">
<li>#files#</li>
</cfloop>
</ol>
<HR>Files Found: #file_counter#<hr>
</cfoutput>
<cfif next_directory_to_crawl eq "">
<!--- There are no more
folders to crawl, stop the main loop --->
<cfset crawl_again =
0>
</cfif>
</cfloop>
That's pretty much it, that will make a local
crawler to find files and much more!
Questions? Comments?
Email Me....
Date added: Sat. July 19, 2003
Posted by: Pablo Varando | Views: 23057 | Tested Platforms: CF5,CFMX | Difficulty: Intermediate
Best Practices
Full Applications
Other
Working w/Data
logic correction
<cfif name neq "." OR name neq ".."> (incorrect) should be AND not OR: <cfif name neq "." AND name neq ".."> (correct)
The first statement causes the script to traverse up the directory tree. The second correctly traverses down the directory tree from your defined starting directory.
Posted by: jason tate
Posted on: 03/16/2006 07:04 PM
|
|