Page 1 of 2 12 LastLast
Results 1 to 25 of 32
  1. #1
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Datafeed Import Optimization
    We haven't really had an advanced discussion on optimizing datafeed imports. I'm getting to the point where I need to start working on this. My larger datafeeds with over 100k products are killing my system resources and taking much too long to process. My system downloads the datafeed and updates each individual product one at a time.

    What I'm considering doing to speed things up is to reduce the number of queries I execute. Right now I'm running a query to find the existing product in the database then a query to either update or insert it. I think if I cache maybe 10 to 25 SKU's and build a single query to find existing products, this may speed things up. So rather than doing this:

    Code:
    select prod_id * from products where sku='ABCD123'
    I'll do this:

    Code:
    select prod_id, sku * from products where sku='ABCD123' or sku='ABCD124' or sku='ABCD134' ...
    This will reduce my queries as I'm doing fewer selects. My tables are all properly indexed, so that optimization has already been covered. I've got other queries that should be reduced as well, so probably a combination of things will improve performance. What other suggestions do you have that'll speed up this process?

    Thanks,
    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  2. #2
    Not Verif-Lidated infoTim's Avatar
    Join Date
    January 18th, 2005
    Location
    Sunny Florida
    Posts
    1,021
    I periodically go through and check every query in my code using EXPLAIN on a live database to see what the bottlenecks might be and to try out alternate queries. (This is in PostgreSQL, I imagine MySQL has something similar)

    It tells you what the query optimizer is doing and what the cost of the query is.
    Tim
    consultant by day, affiliate by night

  3. #3
    Member SeanW's Avatar
    Join Date
    July 28th, 2005
    Posts
    69
    Quote Originally Posted by infoTim
    I periodically go through and check every query in my code using EXPLAIN on a live database to see what the bottlenecks might be and to try out alternate queries. (This is in PostgreSQL, I imagine MySQL has something similar)
    With MySQL I use mysqlard to graph key metrics, and it also makes suggestions on server tuning. I also enable slow query logging to see what queries are taking long to execute or are requiring full joins, then I jump in with the EXPLAIN tool on those queries.

    Sean

  4. #4
    Not Verif-Lidated infoTim's Avatar
    Join Date
    January 18th, 2005
    Location
    Sunny Florida
    Posts
    1,021
    So not to change the subject but how's MySQL these days? I switched away to PostgreSQL about 5 years ago because at the time it could be tuned to perform much faster and had a lot of "big database" features. I have been doing pretty good with it. I've replaced some big expensive Oracle and MS SQL Server installations for some of my clients with Postgres and they've been happy with the results.

    These days so much stuff is written for the LAMP environment and not LAPP and I kind wonder if I'm missing out. On the other hand, if it ain't broke...
    Tim
    consultant by day, affiliate by night

  5. #5
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    Here's one thing I do to keep the datafeed update routines (and any other major processes) from hogging system resources. After every 500 products, check the system load. If the system load is higher than some threshold, pause several seconds and check again. This will effectively pause the datafeed updates until the system load is at a reasonable level.

    Here's the relevant section code in Perl on a Linux system:

    Code:
       while ($line = <FN>) {
    	  $recs++;
    	  if ($recs % 500 == 0) {
    		 my $t = `uptime`; $t =~ m/: ([0-9]*)/; my $load = $1;
    		 if ($load > 3) {
    			while ($load > 3) {
    			   my $pause = $load*3;
    			   print "Pausing $pause seconds after $recs records because load is $load.\n" if $debug == 1;
    			   sleep $pause;
    			   $t = `uptime`; $t =~ m/: ([0-9]*)/; $load = $1;
    			}
    		 }
    	  }
    ### Process Record
       }
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  6. #6
    Newbie dtrenck's Avatar
    Join Date
    January 18th, 2005
    Posts
    33
    Scott,

    I'd write your query a little differently, using "in" - it's more efficient.

    select prod_id, sku from products where sku in ('ABCD123','ABCD124','ABCD125')

    - Dave

  7. #7
    Not Verif-Lidated infoTim's Avatar
    Join Date
    January 18th, 2005
    Location
    Sunny Florida
    Posts
    1,021
    That's a great tip, Michael!

    One thing I do is that on one of my sites there's a half-dozen cron jobs that run at various times to import feeds, update reverse dictionaries, grab reports, and so on. If one happens to run long then they can overlap and really bog things down so I have each one check, set, and remove a global semaphore file that blocks any of the others from running.
    Tim
    consultant by day, affiliate by night

  8. #8
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    I do something similar to that, too, Tim. On non-critical cron jobs, I skip the run if the previous one is still running or if the load average is extremely high.

    I also have features built in that cause my sites to degrade gracefully as the load average goes up. First, the search function is disabled. Then, the rotating banners are replaced with static banners. Then, the banners are removed entirely. Then, all graphics are suppressed. This lets my web servers handle peaks of 2-3 times as much traffic as they could handle with the mods. It comes in really handy when you get a TV or radio mention or a mention on a really popular blog or forum.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  9. #9
    Member
    Join Date
    October 23rd, 2005
    Location
    NYC
    Posts
    65
    Quote Originally Posted by MichaelColey
    I also have features built in that cause my sites to degrade gracefully as the load average goes up.
    very nice ideas ;-)

  10. #10
    Newbie
    Join Date
    February 6th, 2005
    Location
    North Carolina
    Posts
    16
    CSV to SQL file
    This is a tool I made over the weekend CSV to SQL

    You just upload your csv file and enter any name for the table name and the download a zip file with your new sql file ready to run on your sql db. Any comments or suggestions are welcomed
    Tommy

    http://www.data-feed.net/csv2sql/

  11. #11
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by scriptfeed
    This is a tool I made over the weekend CSV to SQL

    You just upload your csv file and enter any name for the table name and the download a zip file with your new sql file ready to run on your sql db. Any comments or suggestions are welcomed
    Tommy

    http://www.data-feed.net/csv2sql/
    This is a little off topic as we're discussing automating datafeed management and using fewer system resources.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  12. #12
    Full Member heisje's Avatar
    Join Date
    January 18th, 2005
    Posts
    314
    .


    off topic, but relevant:

    if you run hundreds of large feed-based sites, and attempt to update / manage them on-server, do you totally bog down your server resources or not?

    just wondering . . .

    heisje


    .

  13. #13
    Member
    Join Date
    October 23rd, 2005
    Location
    NYC
    Posts
    65
    hey, I don't run large database sites, but I really don't see the problem...

    update then swap in new for old...for example, say I have feed_A.db ...can't you just do this:

    run update to feed_B.db
    complete update
    move feed_B.bd to feed_A.db?

    then you really only have one second when feed_B.db is overwriting feed_A.db...hmmm...but I am sure it is more complicated than this

  14. #14
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    Quote Originally Posted by heisje
    if you run hundreds of large feed-based sites, and attempt to update / manage them on-server, do you totally bog down your server resources or not?
    If you don't do it right, you can bog down your server(s). That's what this thread is about.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  15. #15
    Full Member
    Join Date
    January 18th, 2005
    Posts
    396
    I still build my static pages on my local computer using datafeeds I download. I ZIP up these pages and ftp them to my server. I use a late night unzipping cron job on the server somewhat of the form 'nice -n19 unzip ...' Nice -n19 runs my unzip job with the lowest priority on a Linux server so if any other server request starts, my job holds off and resumes when the other job finishes. As a practical matter this only doubles my unzipping time and keeps the hosting company from venting on me.

    Charles

  16. #16
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    I just had an idea. Considering the ideas suggested here, one could build a datafeed import script that runs regularly throughout the day. Maybe it can execute every 30 minutes only if it's not currently executing. Using Michael's solution to maintain a viable load, this can run constantly and not impede the buyer's experience. What would be ideal is to select a datafeed from your database that hasn't been updated over the longest period of time. Once that's complete, the next least updated can be executed in the next round and so on. Nighttime updates have treated me well to this point, but eventually with enough feeds the night isn't long enough.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  17. #17
    Not Verif-Lidated infoTim's Avatar
    Join Date
    January 18th, 2005
    Location
    Sunny Florida
    Posts
    1,021
    I do that for some feeds...run periodic updates throughout the day based on "fetch lists" in a database table that can be configured from an admin interface. To pull in a new set of products, I just add a new entry to the fetch list.
    Tim
    consultant by day, affiliate by night

  18. #18
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    I love this script for importing CSV to SQL:
    http://www.scriptsearch.com/details/6497.html

    It doesn't bog down for me, but my weekly upload is only ~7,000 rows / 5 MB.

    If bogging down is an issue, maybe a "pause" routine could be added into it's architecture.

    As a relative newbie to php / sql / datafeeds, finding this upload script was a fantastic thing for me - free, works well, nice interface, no problems at all. If bogging / optimization is an issue, seems like an (relatively) easy matter to add a little slow-down-the-process steps into it (with the author's permission).

  19. #19
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Donuts,

    I wanted to check this script and see what it's all about but the link you gave doesn't seem to work. It brings up navigation with a blank body.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  20. #20
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    The link I posted is now also not working for me. Sorry. I did check it before posting earlier today.

    Search google for "Matthew Lindley CSV import script" and you'll find other sources.

    Here's one:
    http://www.hotscripts.com/Detailed/18098.html

  21. #21
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Just read the blurb and it sounds pretty neat. I've actually created something very similar and it saves a ton of time. It was always a pain assigning datafeed columns to my own database schema. I had to download the feed into Excel and manually view it. Glad I don't have to do that anymore.

    It doesn't sound like this will improve automated import performance though. That's a whole different issue.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  22. #22
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    Quote Originally Posted by Snib
    I just had an idea. Considering the ideas suggested here, one could build a datafeed import script that runs regularly throughout the day. Maybe it can execute every 30 minutes only if it's not currently executing. Using Michael's solution to maintain a viable load, this can run constantly and not impede the buyer's experience.
    That's basically what I do. For most of the networks, I check for new feeds every 30 minutes, process whatever new feeds I find, and throttle the updates so it keeps the loads reasonable. Basically, I end up with a database that is continually updated and then my pages access that database.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  23. #23
    Lite On The Do, Heavy On The Nuts Donuts's Avatar
    Join Date
    January 18th, 2005
    Location
    Winter Park, FL
    Posts
    6,930
    Quote Originally Posted by Snib
    I had to download the feed into Excel and manually view it.
    Whenever I used Excel, my CSV file format would be hosed. I tried all different ways of subsequent file saving and all gave me problems. Now I just avoid any editing of CSV datafeed files with Excel.

    I'm sure you guys can fix this issue in a snap - a training issue for me - but at this point, I was so frustrated with Excel and it's lack of help and specifics in this area, that I won't change my habits. Phooey on Excel (when it comes to feeds).

  24. #24
    Affiliate Manager nish's Avatar
    Join Date
    July 6th, 2005
    Posts
    762
    Apart from the 'least updated feed' factor, depending on your knowledge of how frequently the merchants update their feeds, you can also give weights/priorities to each of the feeds.

    Honestly I'm pretty impressed by the sophistication that you folks exhibit in automating all the datafeed processing tasks. Automation is the key.

    cheers!
    -nishith

  25. #25
    ABW Ambassador FFoc's Avatar
    Join Date
    January 18th, 2005
    Posts
    1,015
    Quote Originally Posted by Donuts
    If bogging down is an issue, maybe a "pause" routine could be added into it's architecture.
    A quick note - if you're using MySQL, you can use the LOAD DATA INFILE command to load data direct from a csv file (or tab, or pipe, etc) with the LOW_PRIORITY modifier, and execution of the LOAD DATA statement will be delayed until no other clients are reading from the table.

+ Reply to Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. Best way to import datafeed for performance
    By MoreBeer in forum Programming / Datafeeds / Tools
    Replies: 15
    Last Post: May 25th, 2010, 02:26 PM
  2. Link to Import Datafeed
    By bluewaves1 in forum Programming / Datafeeds / Tools
    Replies: 11
    Last Post: March 13th, 2008, 07:02 PM
  3. Help with datafeed import
    By adsw in forum Programming / Datafeeds / Tools
    Replies: 8
    Last Post: September 26th, 2006, 02:05 PM
  4. Help with datafeed import
    By adsw in forum Midnight Cafe'
    Replies: 4
    Last Post: September 25th, 2006, 06:13 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •