WebPagetest Forums
Database Query Search Time - Printable Version

+- WebPagetest Forums (https://www.webpagetest.org/forums)
+-- Forum: Web Performance (/forumdisplay.php?fid=3)
+--- Forum: Optimization Discussions (/forumdisplay.php?fid=5)
+--- Thread: Database Query Search Time (/showthread.php?tid=295)

Pages: 1 2 3 4 5 6

RE: Database Query Search Time - ringman57 - 08-24-2010 11:52 PM


First, great post! I have read through it 3 times and there is a ton of great info here.

I am upgrading a clients verity collection to solr, and I am having many woes, most notably the dreaded exceeded maxWarmingSearchers. I am the guy that posted this issue on coldfusionjedi
link to post.

Deleting and re-creating the collection did help the issue initially, but when indexing large numbers of files, I think its unavoidable, and you are forced to dramatically slow down your indexing to allow the searchers to catch up.

I entered this as a bug and it has been confirmed, with a workaround suggested. One way is to do a bulk insert, they way you are doing it, which will not work for indexing physical files where there is no way to know the body ahead of time, the other is the way I am currently doing it, which is to schedule small batches (4 files since my maxSearchWarmers is set to 4) every 90 seconds or so. As you can guess indexing 60k+ files is going to take a very long time. (300+ hours?)

Increasing maxSearchWarmers and increasing the number of files only moves the point at which the error occurs and uses more resources.

I am looking for a fresh perspective on the issue, and perhaps hoping that the knowledge you have gleaned from your research may permit you to offer a suggestion.

Anthony DeSalvatore

RE: Database Query Search Time - green-watch.org - 08-25-2010 01:05 AM

Greetings Anthony,

When I started having my maxWarmingSearchers problem, I actually came across your post on coldfusionjedi. I purged my collection and then reindexed it all over again like you did. It worked great at first and the reason why is because data takes less time to index with smaller collections.

Indexing takes longer as a SOLR collection grows larger. With my first indexing attempt, my script started executing while another one was already in progress. That really bogged down the server and things got messed up rather quickly. At that point, I created a variable in my GLOBAL table that checked to see if the indexing script was already in progress before executing the indexing code again. Having said that, even if the variable indicated the code is not currently executing, it took some time before the number of files got recognized by the coldfusion collection. I think there is some sort of "commit" bug if files get indexed too quickly.

You might want to revisit the bulk insert concept. If you have actual files, you can use CFFILE to read the content into a variable, and then use that variable to create a CFQUERY row. Once you have the CFQUERY created, you can pass that to the CFINDEX to do the bulk insert.

I wish I could see what goes on under the hood of CFINDEX. I am sure it would be possible to create a custom function like i did for the CFSEARCH function.

By the way, if you want to post some example indexing code, I can take a glance at it on here and see if I have any suggestions.

Travis Walters

RE: Database Query Search Time - ringman57 - 08-25-2010 02:54 AM

Stay tuned! I just found out that you can create a query of file names/paths and make that field the body="" value and cfindex will index the file content.


Tried it and it works... I will do some more testing later today.

RE: Database Query Search Time - green-watch.org - 08-25-2010 03:25 AM

Hey There,

Bulk inserts definitely makes CFINDEX work a bit better. I will tell you right now though it is not a solution to the maxWarmingSearchers problem. You will have to slow down the script as the collection grows larger. However, bulk inserts seem to allow more documents to be added in a given time frame that adding them individually. I just wish I had resources on my server to add a million at once instead of 10,000. Now it has me wondering how many documents I can really add at once. I plan on experimenting a bit before reindexing my entire product collection. Talk to you soon.

Update: I submitted another suggestion to the Adobe team. Bug / Enhancement Report 83931:

I think it would be nice if we could transfer information between a database and SOLR easier through CFINDEX. While we can do bulk inserts via CFQUERY, there are only so many rows we can transfer at once because of limited system resources. One parameter of CFINDEX could be "datasource" to establish a connection to the database. Another parameter could be FIELDLIST, which could be 2-dimensional array where one dimensional could be the database column name and second dimension could be the SOLR column name. From there, SOLR could insert the entire database via whatever means necessary through CFINDEX.

Update #2: I found some information regarding distributed SOLR searches:


More than likely, an application would set up the shards parameter values as part of the SolrRequestHandler's defaults configuration in the solrconfig.xml, such that the names of all the sharded servers do not need to be passed in with the query every time.

Update #3 - Submitted Enhancement 83933 to Adobe:

The SOLR documentation describes a way to give documents a boost in terms of their score value regardless of what the user is searching for.

<doc boost="2.5">
<field name="employeeId">05991</field>
<field name="office" boost="2.0">Bridgewater</field>

Can we have a way to give documents a boost with CFINDEX?

Travis Walters

RE: Database Query Search Time - green-watch.org - 08-25-2010 04:59 AM

Hey Everyone,

I have been looking into a way to index documents faster than CFINDEX when dealing with very large datasources. Here are some documents I have found:


I am thinking about creating my own wrapper.

Update: I just found some information on merging two collections.


I submitted this to Adobe as bug / enhancement #83935.

Travis Walters

RE: Database Query Search Time - green-watch.org - 08-25-2010 10:59 PM

Hey Anthony,

When you say you have actual files, are they formatted similar to this:

  <field name="key">KeyValue1</field>
  <field name="custom1">Custom1Value1</field>
  <field name="custom2">Custom2Value1</field>
  <field name="custom3">Custom3Value1</field>
  <field name="custom4">Custom4Value1</field>
  <field name="summary">SummaryValue1</field>
  <field name="key">KeyValue2</field>
  <field name="custom1">Custom1Value2</field>
  <field name="custom2">Custom2Value2</field>
  <field name="custom3">Custom3Value2</field>
  <field name="custom4">Custom4Value2</field>
  <field name="summary">SummaryValue2</field>

From what I have read, this is the format CFINDEX has to parse a file into so that SOLR will add or edit documents to the collection. There is also a CSV format SOLR will accept. However, with the CSV format, you can not give fields bonus values.

There might be a way to get SOLR to import data directly from a database:

Travis Walters

RE: Database Query Search Time - green-watch.org - 08-26-2010 08:29 PM

Hey Guys,

Revisiting a few issues I have talked about.

#1 - Price Ranges - instead of storing price ranges, you can store the actual prices. However, I found it to be important that all prices have the same amount of characters even if the data field is stored as type double. In order to do this, I get the string length of the price and then add leading zeros accordingly depending on what the length is. Search time you can do something like this to get prices between $98 dollars and $108 dollars: custom2:[000000098.00 TO 000000108.00]

#2 - Boolean String Field - Remember how I said I am storing a boolean system in my custom4 field. Take the number "010110050100" for example. By the way, this field is being stored as an integer data type now.

0 - Leading 0 needed for wildcard system to work
1 - Product Validated (0 or 1)
0 - Has External Photo (0 or 1)
1 - Has Internal Photo (0 or 1)
1 - Business Validated (0 or 1)
0 - Error Viewing Photo (0 or 1)
0 - Is Featured Product (0 or 1)
5 - Business Type ID (1 or 2 or 3 or 4 or 5)
0 - HTML Markup Error (0 or 1)
1 - Greenness Updated (0 or 1)
0 - Short Product Description (0 or 1)
0 - Ending 0 needed for wildcard system to work

At search time, if I want to find products that are not validated, I can do something like this: custom4:00?????????0

At search time, if I want to find products that are not validated and have a short product description, I can do this: custom4:00????????10

Both of these changes allow for the collection to be smaller. On another note, here is version 1.4 of my SOLR_SEARCH:

<cfsetting requestTimeOut = "60">

<cfset SOLR_SERVER_HOST = 'localhost'>
<cfset SOLR_SERVER_PORT = '8983'>
<cfset SOLR_SERVER_VERSION = '2.2'>

<cffunction name="SOLR_SEARCH" returnType="struct">

  <cfargument name="collection" type="string" required="true" default="">
  <cfargument name="criteria" type="string" required="false" default="*:*">
  <cfargument name="startrow" type="string" required="false" default="0">
  <cfargument name="maxrows" type="string" required="false" default="10">
  <cfargument name="sort" type="string" required="false" default="">
  <cfargument name="columns" type="string" required="false"
  <cfargument name="facet" type="string" required="false" default="">
  <cfargument name="timeout" type="string" required="false" default="10">
  <cfargument name="timeAllowed" type="string" required="false" default="10">
  <cfset returnStruct = StructNew()>
  <cfif NOT IsDefined("timeout") OR NOT IsNumeric(timeout) OR #timeout# LT 1>
    <cfset timeout = 10>

  <cfif NOT IsDefined("timeAllowed") OR NOT IsNumeric(timeAllowed) OR #timeAllowed# LT 1>
    <cfset timeAllowed = 10>

  <cfif NOT IsDefined("startrow") OR NOT IsNumeric(startrow) OR #startrow# LT 0>
    <cfset startrow = 0>

  <cfif NOT IsDefined("maxrows") OR NOT IsNumeric(maxrows) OR #maxrows# LT 1>
    <cfset maxrows = 1>

  <cfif NOT IsDefined("criteria")>
    <cfset criteria = "*:*">

  <cfset myURL = 'http://#SOLR_SERVER_HOST#:#SOLR_SERVER_PORT#/solr/#collection#/select/'>
  <cfset myURL = '#myURL#?q=#criteria#&version=#SOLR_SERVER_VERSION#&start=#startrow#&rows=#maxrows#&indent=off'>

  <cfif IsDefined("sort") AND #sort# NEQ "">
    <cfset myURL = '#myURL#&sort=#sort#'>

  <cfif IsDefined("columns") AND #columns# NEQ "">
    <cfset myURL = '#myURL#&fl=#columns#'>

  <cfif IsDefined("facet") AND #facet# NEQ "">
    <cfset myURL = '#myURL#&facet=true&#facet#'>

  <cfif IsDefined("timeAllowed") AND #timeAllowed# NEQ "">
    <cfset myURL = '#myURL#&timeAllowed=#timeAllowed#'>

  <cfhttp method="Get" url="#myURL#" resolveurl="Yes" timeout="#timeout#" throwonerror="yes" />

  <cfset xmlfile=XMLParse(cfhttp.FileContent)>

  <cfset numResultsFound = #xmlfile.XmlRoot.XmlChildren[2].XmlAttributes.numFound#>
  <cfset numResultsReturned = #ArrayLen(xmlfile.XmlRoot.XmlChildren[2].XmlChildren)#>

  <cfset partialResults = 'false'>
  <cfset queryFieldList = ''>
  <cfset queryStatus = ''>
  <cfset queryTime = ''>

  <cfloop index="i" from="1" to="#ArrayLen(xmlfile.XmlRoot.XmlChildren[1].XmlChildren)#">  
    <cfif #xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlAttributes.name# EQ "params">
      <cfloop index="j" from="1" to="#ArrayLen(xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlChildren)#">
        <cfif #xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlChildren[j].XmlAttributes.name# EQ "fl">
          <cfset queryFieldList = '#xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlChildren[j].XmlText#'>
    <cfelseif #xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlAttributes.name# EQ "status">
      <cfset queryStatus = '#xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlText#'>
    <cfelseif #xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlAttributes.name# EQ "QTime">
      <cfset queryTime = '#xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlText#'>
    <cfelseif #xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlAttributes.name# EQ "partialResults">
      <cfset partialResults = '#xmlfile.XmlRoot.XmlChildren[1].XmlChildren[i].XmlText#'>      

  <cfif #queryFieldList# NEQ "">    
    <cfset xmlqry = QueryNew("#queryFieldList#")>  
    <cfif #numResultsReturned# NEQ 0>

      <cfset xmlqryRows = QueryAddRow(xmlqry,numResultsReturned)>

      <cfloop index="i" from="1" to="#numResultsReturned#">
        <cfif #ArrayLen(xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren)# NEQ 0>    
          <cfloop index="j" from="1" to="#ArrayLen(xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren)#">
            <cfloop list="#queryFieldList#" index="k">
              <cfset myElement = '#xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren[j].XmlAttributes#'>
                 <cfif #ArrayLen(StructFindValue(myElement,k))# NEQ 0>

                <cfif #k# EQ "spell">
                  <cfset myElement = '#xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren[j].XmlChildren[1].XmlText#'>
                  <cfset xmlqryCell = QuerySetCell(xmlqry,"#k#",myElement,"#i#")>              
                <cfelseif #k# NEQ "category">
                  <cfset myElement = '#xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren[j].XmlText#'>
                  <cfset xmlqryCell = QuerySetCell(xmlqry,"#k#",myElement,"#i#")>
                  <cfset catList = '#xmlfile.XmlRoot.XmlChildren[2].XmlChildren[i].XmlChildren[j].XmlChildren[1].XmlText#'>
                  <cfset xmlqryCell = QuerySetCell(xmlqry,"#k#",catList,"#i#")>
    <cfset xmlqryRows = QueryAddRow(xmlqry2,1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry2,"RESULTS_FOUND","#numResultsFound#",1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry2,"QUERY_STATUS","#queryStatus#",1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry2,"QUERY_TIME","#queryTime#",1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry2,"RESULTS_RETURNED","#numResultsReturned#",1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry2,"PARTIAL_RESULTS","#partialResults#",1)>

    <cfset xmlqry3 = QueryNew("QUERY_TIMEOUT,NULL_FIELDLIST")>
    <cfset xmlqryRows = QueryAddRow(xmlqry3,1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry3,"QUERY_TIMEOUT","0",1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry3,"NULL_FIELDLIST","0",1)>
    <cfset returnStruct[0] = xmlqry3>
    <cfset returnStruct[1] = xmlqry2>
    <cfset returnStruct[2] = xmlqry>

    <cfset xmlqry3 = QueryNew("QUERY_TIMEOUT,NULL_FIELDLIST")>
    <cfset xmlqryRows = QueryAddRow(xmlqry3,1)>
    <cfset xmlqryCell = QuerySetCell(xmlqry3,"QUERY_TIMEOUT","0",1)>    
    <cfset xmlqryCell = QuerySetCell(xmlqry3,"NULL_FIELDLIST","1",1)>
    <cfset returnStruct[0] = xmlqry3>
  <cfreturn returnStruct>

<cfparam name="URL.SEARCH_FOR" default="" type="string">

<cfset args = StructNew()>
<cfset args.collection = "products_id2">
<cfset args.startrow = 0>
<cfset args.maxrows = 10>
<cfset args.columns = 'key,custom1,custom2,custom3,custom4,summary,title,spell'>
<cfset args.sort = 'score+desc'>
<cfset args.criteria = '#URL.SEARCH_FOR#'>
<cfset args.timeout = 5>
<cfset args.timeAllowed = 3000>

<cfinvoke method="SOLR_SEARCH" argumentCollection="#args#" returnVariable="res">

<cfdump var="#res#">

When you submit a document to SOLR, it stores a short version of the description in the summary field and stores the entire version in some spell field.

Now that I am returning the entire product description, price, product id, and product name, I do not have to query my database after I retrieve the key value. This makes an even faster search yet. Time to reindex yet again (:

Travis Walters

RE: Database Query Search Time - green-watch.org - 08-29-2010 12:30 PM

Hey Guys,

It appears that CFINDEX may only delete one document at a time unless you pass a query to it like this:

<CFSEARCH COLLECTION="training" NAME="DeleteCertainKeys" CRITERIA="">

Why should you have to do a CFSEARCH just to delete the documents when you can do something like this:

<cfset myURL = 'http://#SOLR_SERVER_HOST#:#SOLR_SERVER_PORT#/solr/#collectionName#/update?stream.body='>
<cfset myURL = '#myURL##URLEncodedFormat("<delete><query>custom2:54</query></delete>")#'>
<cfhttp method="Get" url="#myURL#" resolveurl="Yes" throwonerror="yes" />
<cfset myURL = 'http://#SOLR_SERVER_HOST#:#SOLR_SERVER_PORT#/solr/#collectionName#/update?stream.body=%3Ccommit/%3E'>
<cfhttp method="Get" url="#myURL#" resolveurl="Yes" throwonerror="yes" />

I think when the CFINDEX action parameter is set to delete, you should be able to pass in a query instead of just a key.

Here is version 1.0 of my SOLR_INDEX function:

<cffunction name="SOLR_INDEX" returnType="void">

  <cfargument name="collection" type="string" required="true" default="">
  <cfargument name="action" type="string" required="true" default="">
  <cfargument name="criteria" type="string" required="false" default="">
  <cfif NOT IsDefined("criteria")>
    <cfset criteria = "*:*">
  <cfif #action# EQ "delete">  
    <cfset myURL = 'http://#SOLR_SERVER_HOST#:#SOLR_SERVER_PORT#/solr/#collection#/update?stream.body='>
    <cfset myURL = '#myURL##URLEncodedFormat("<delete><query>#criteria#</query></delete>")#'>  
    <cfhttp method="Get" url="#myURL#" resolveurl="Yes" throwonerror="yes" />
    <cfset myURL = 'http://#SOLR_SERVER_HOST#:#SOLR_SERVER_PORT#/solr/#collection#/update?stream.body=%3Ccommit/%3E'>
    <cfhttp method="Get" url="#myURL#" resolveurl="Yes" throwonerror="yes" />  


<cfset args = StructNew()>
<cfset args.collection = "#prodCollectionName2#">
<cfset args.action = "delete">
<cfset args.criteria = 'category:78'>

<cfinvoke method="SOLR_INDEX" argumentCollection="#args#">

Travis Walters

RE: Database Query Search Time - green-watch.org - 09-20-2010 04:39 AM

Hey Guys,

I have taken a step back from the SOLR implementation to work on the scripts leading up to the indexing of the products. These scripts include imported zip files from Google, unzipping those files and creating product csv files, importing the csv files into a database table, processing that information into the main product table, and then calculating the greenness of the products. My ultimate goal is to get all of these scripts working in such a way that products are updated at least once a week.

If anyone has to import anything containing millions and millions of records, you may be better off doing a bulk insert through a csv file like this:

<cfquery datasource="#dsnName#">
FROM 'F:\products\feeds\unzipped\testCSV.csv'

On another, note my greenness script seems to be getting slower as more and more products are updated. I think this is due to this:

<cfquery datasource="#dsnName#" maxrows="1">
SET PRODUCT_GREENNESS.PRODUCT_HTML_PENALTY_ENABLED = <cfqueryparam cfsqltype="CF_SQL_NUMERIC" value="#invalidCharsFound#">,
<cfqueryparam cfsqltype="CF_SQL_NUMERIC" value="#Len(PRODUCT_INFO.PRODUCT_DESCRIPTION)#">

PRODUCT_GREENNESS is a view for my PRODUCT table that displays rows when GREENNESS_UPDATED equals 0. Changes in the PRODUCT_GREENNESS table reflect changes in the PRODUCT table. Since the record I want to edit gets farther from the top as the script executes, it is taking longer each time.

Is there anyway with SQL that I can say update where this = this and start at some row number for searching? By the way, I am using MS SQL Server 2005 as my database.

Travis Walters

RE: Database Query Search Time - green-watch.org - 09-28-2010 09:44 PM

Hey Guys,

This post will explain how to setup an extremely simple DataImportHandler for SOLR using a MS SQL Server 2005 database.

First, you will need to add some code to the solrconfig.xml file as follows:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">F:\collections\products_id8\conf\data-config.xml</str>

The data-config.xml will be exactly as follows:

<?xml version="1.0" encoding="UTF-8" ?>
  <dataSource type="JdbcDataSource" name="GreenEarth3" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
  url="jdbc:sqlserver://localhost;databaseName=myDSNName;responseBuffering=adaptive;" user="myUsername" password="myPassword" />
  <document name="products">
    <entity name="myProduct" dataSource="myDSNName" pk="PRODUCT_ID" query="SELECT PRODUCT_ID FROM TESTPRODUCT">
      <field column="PRODUCT_ID" name="PRODUCT_ID" />        

In the schema.xml file, I have the <fields> section as follows:

  <field name="PRODUCT_ID" type="integer" indexed="true" stored="true" required="true" />
  <field name="contents" type="text" indexed="true" stored="false" required="false" multiValued="true" omitNorms="true" />


You will also need to get a copy of the JAR file found here:

The sqljdbc.jar file must be placed in "C:\ColdFusion9\solr\lib" or wherever you installed SOLR on your server.

After you set all of this up, I suggest going here on your server:

From there, you can do full imports, delta imports, and much more.

Once I explore this a bit more, I will talk about delta imports and some other advanced stuff. It will be interesting to see how much faster all of the information can be indexed using this method.

Travis Walters