Results 1 to 16 of 16
  1. #1
    Member
    Join Date
    November 3rd, 2009
    Posts
    50
    Question Best way to import datafeed for performance
    I'm building a platform which basically imports datafeeds into my database which will be used for a variety of websites.
    The database will contain products from different network, merchants etc

    Every time I import a feed I loop through all the products in it, for every product I check if it's already in the database or needs to be updated, then I add/update it.

    This will put a lot of stress on the database

    I've seen some examples which simply import everything to a temp table and, once finished, this table is renamed as the main one. This is surely good when you use a single datafeed, but when you have many datafeeds (hence you are updating just a fraction of the records) you would need to copy all the current main table records to a temp table, import/update the feed and then rename this one to be used as main one.

    what could be a smart solution to keep good db performance during feed updates?

  2. #2
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    What you're doing is the right way to do it. It's not THAT much of a stress.

    A couple additional things I would recommend:

    1) Put a "circuit breaker" in it. After every 500 or 1000 records, either have it pause for a few seconds or have it check the system load and pause until it drops to a reasonable level.

    2) When you process the records for a merchant, make sure you handle deletes. What you described doesn't (although you might have just simplified the description). If there are records in your database that aren't in the new datafeed, those need to be "deleted". I suggest that you first identify the scope of the deletes, and if it's excessive (like over 5% or 10% of the merchant's products), email a warning and abort. Also, it's often best to leave the deletes and just flag them as deleted. Sometimes they come back, and perhaps you want to leave a placeholder on your site and just use AdSense or something on those pre-existing pages rather than deleting/breaking them.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  3. #3
    Member vicjg's Avatar
    Join Date
    January 12th, 2010
    Posts
    80
    Quote Originally Posted by MoreBeer View Post
    I'm building a platform which basically imports datafeeds into my database which will be used for a variety of websites.
    The database will contain products from different network, merchants etc

    Every time I import a feed I loop through all the products in it, for every product I check if it's already in the database or needs to be updated, then I add/update it.

    This will put a lot of stress on the database

    I've seen some examples which simply import everything to a temp table and, once finished, this table is renamed as the main one. This is surely good when you use a single datafeed, but when you have many datafeeds (hence you are updating just a fraction of the records) you would need to copy all the current main table records to a temp table, import/update the feed and then rename this one to be used as main one.

    what could be a smart solution to keep good db performance during feed updates?
    I would dump the new feed into a separate table all at once. I assume that both your database and the feeds use some sort of primary key (usually a combo or merchant id + product id). Then just make two insert queries.

    1. Inner join both tables and update the products based on whatever criteria you're using to determine if the record needs to be updated.

    2. An outer join on the two tables. Whatever isn't in the main table can just all be inserted with one INSERT statement.

    There's no need to loop through the records at all.

  4. #4
    Member
    Join Date
    November 3rd, 2009
    Posts
    50
    thanks for the suggestions

    @MichaelColey: I was thinking to insert breaks every 1000 records
    Instead as far as it regards deletes I just created a status field in the database that I would use as 1 = active, 0 = not active, 2 = whatever else might be necessary, but I had not thought about how to handle deletes in details.
    Good recommendation for the scopes of the deletes.

    @vicjg: my db primary key is the product ID ( auto increment numeric field generated on my end, not related to datafeeds)
    I can identify a specific product by feedID + sku ( to check if a product needs to be updated I check the lastupdated date for every datafeed line searching for feedID + sku, where feedID is not the product id but the ID of the feed: eg. laptops by merchant XXX), but the feeds are plain text files with no id before being processed and added to the database.

    your solution seems very interesting but I need some guidance to fully understand what you mean:

    are you talking about a couple nested queries, something like this?
    assume 'tba' is my live products table, 'tbb' is a temp table where the feed has been imported, tba.feedid is the feed ID in my database:
    Code:
    UPDATE tba SET tba.name = tbb.name,
    tba.manufacturer = tbb.manufacturer,
    tba.price = tbb.price
    ............
    WHERE tba.id IN (SELECT tba.id FROM tba,tbb WHERE tba.feedid = X AND tba.sku = tbb.sku AND tba.lastupdate < tbb.lastupdate)
    and for inserting something like
    Code:
    INSERT into tba ( SELECT * FROM tbb WHERE tbb.sku NOT IN (SELECT sku FROM tba WHERE feedid = X))
    I hope what I wrote is not that confusing...
    I don't even know if something like this can be done

    would such queries be extremely resource intensive? some feeds have tens of thousands of products

  5. #5
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    A key of merchant_id (your internal ID for the merchant) and sku (the unique product number from the merchant) is probably the best. An internal sequential ID is pretty useless.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  6. #6
    Newbie
    Join Date
    March 13th, 2010
    Posts
    20
    I maintain a table containing datafeed filenames and timestamps. The first check is to import a datafeed only if the timestamp has changed.

    Secondly, the data is chunked of course... can't be loading 500mb datafeeds into memory.

    The record is checked on unique product codes and the values from existing record is saved in array. If no matching record is found, and Insert operation happens.

    If existing record is found, Values from datafeed record are saved in a separate array. If the Mysql values array and Datafeed values Array are not equal, then record needs to be updated.

    Works well, its not blindingly fast... but does its job without crashing the db server.

  7. #7
    ABW Ambassador isellstuff's Avatar
    Join Date
    November 9th, 2005
    Location
    Virginia
    Posts
    1,659
    The best way to get a lot of rows into a table quickly is to avoid any intermediate index rebuilds. With MySQL, I use "LOAD DATA INFILE" and reference an external text file. I've also used another method at times, which is to lock the tables before and chunked the inserts without doing the batch load from file. The intent is basically the same, don't let the DB do anything until I'm done inserting.

    BTW, my feed architecture is not dynamic. I basically pre-build everything with no DB normalization (speed, speed, speed), then send a read-only copy into production. So my suggestions might not be best for feed architectures that require very regular insertion.

    My goal has always been to scale out my front end servers so that they may handle 100 or more requests per second in order to minimize my hardware costs.
    Merchants, any data you provide to Google Shopping should also be in your affiliate network datafeed. More data means more sales!

  8. #8
    Visual Artist & ABW Ambassador lostdeviant's Avatar
    Join Date
    September 7th, 2007
    Location
    Cuautitlán, Edo. de México
    Posts
    1,725
    "I've seen some examples which simply import everything to a temp table and, once finished, this table is renamed as the main one. This is surely good when you use a single datafeed, but when you have many datafeeds (hence you are updating just a fraction of the records) you would need to copy all the current main table records to a temp table, import/update the feed and then rename this one to be used as main one."

    no, you don't need to do that.
    You don't need to load all the main records.
    you could 1. make the temp table, 2. delete records matching merchant you are importing from main table. 3. insert temp table into main table. 4. drop temp table. 5. optimize main table (since the deleted entries create gaps)

  9. #9
    Member vicjg's Avatar
    Join Date
    January 12th, 2010
    Posts
    80
    Quote Originally Posted by MoreBeer View Post
    ...

    would such queries be extremely resource intensive? some feeds have tens of thousands of products
    I'll try use a specific example to maybe clear things up for you.

    I have a database table called "products". It has a primary key (merchant_id + product_id), that contains a list of products.

    Now I get a datafeed from Merchant X (merchant_id=123). I'll dump the feed into a temp table called #products_123.

    In T-SQL I'd run something like this for the update:

    UPDATE products SET Description (whatever fields you wanted to update) =
    #products_123.Description FROM products INNER JOIN #products_123 ON products.product_id=products_123.product_id
    where products.merchant_id=123 AND products.Timestamp<#products_123.Timestamp (if timestamp was the criteria we'd use to determine if a record changed)

    The insert would look like this: (this is only one way of doing it, an outer join would work too)

    INSERT INTO products (merchant_id, product_id, ...)
    SELECT 123,product_id, ... FROM #products_123 WHERE
    product_id (NOT IN SELECT product_id FROM products where merchant_id=123)

    BTW I'll be sending my consulting invoice# through PM

  10. #10
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    Be careful with anything that updates a large number of records in a "live" table as a group. Depending on what you're doing and how you're doing it, it can lock the table and cause any processes trying to access it to hang until the update is complete. Nothing worse than having a page take several minutes to render because an update is going on.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  11. #11
    Member
    Join Date
    November 3rd, 2009
    Posts
    50
    thanks guys

    Quote Originally Posted by MichaelColey View Post
    ...
    Nothing worse than having a page take several minutes to render because an update is going on.
    that's indeed what I want to avoid

  12. #12
    Member vicjg's Avatar
    Join Date
    January 12th, 2010
    Posts
    80
    I typically keep my "web content" separate from the master tables. For instance, I'll have a master table of all products, and a separate table for "products" for the website.

    This way, I can get rid of unneeded columns for the web tables and speed it up and when it comes to updates, I can do all the updates on the master and at the last minute, fill a temp table and rename it for the web.

  13. #13
    Newbie Dave123's Avatar
    Join Date
    February 5th, 2010
    Location
    Seattle, WA
    Posts
    30
    Quote Originally Posted by MoreBeer View Post
    thanks guys


    that's indeed what I want to avoid
    Beware of the one-by-one update method you speak of. Depending on the arcane details of SQL, you could be locking a 'page' of data with your update stmt, despite your efforts to avoid this. Sorry, I don't have a helpful answer, just a warning.

  14. #14
    Member soda's Avatar
    Join Date
    March 26th, 2010
    Location
    Irvine, CA
    Posts
    36
    updating existing feeds
    Quote Originally Posted by MoreBeer View Post
    This will put a lot of stress on the database
    Get a better hosting/server in this case.

    Re. updating existing feeds, here's my approach:

    1) Create (or truncate) temporary products table for products from current merchant's datafeed:

    Code:
    TRUNCATE TABLE products_data_temp
    PS temp products table structure is identical to master products table

    2) Parse / process data table into temporary products table

    3) In your current 'master' products table set all in_stock info to 'no', e.g.

    Code:
    UPDATE products_data SET in_stock = 0 WHERE 1
    4) Add ONLY new unique products from merchant's datafeed, ignore others, e.g.

    Code:
    INSERT IGNORE INTO `products_data` (`hash`, `merchant_id`, `name`, `name_slug`, `brand`, `brand_slug`, `manufacturer`, `sku`, `upc`, `description`, `description_short`, `category`, `sub_category`, `product_group`, `keywords`, `name_keywords`, `price`, `retail_price`, `sale_price`, `clearance_item`, `currency`, `in_stock`, `condition`, `images`, `buy_url`) 
    SELECT
    `hash`, `merchant_id`, `name`, `name_slug`, `brand`, `brand_slug`, `manufacturer`, `sku`, `upc`, `description`, `description_short`, `category`, `sub_category`, `product_group`, `keywords`, concat(name,' ',keywords) as name_keywords, `price`, `retail_price`, `sale_price`, `clearance_item`, `currency`, `in_stock`, `condition`, `images`, `buy_url` FROM `products_data_temp` WHERE 1
    5) Update all products prices, in stock info for current merchant:

    Code:
    UPDATE `products_data`, `products_data_temp` SET `products_data`.`price` = `products_data_temp`.`price`, `products_data`.`retail_price` = `products_data_temp`.`retail_price`, `products_data`.`sale_price` = `products_data_temp`.`sale_price`, `products_data`.`clearance_item` = `products_data_temp`.`clearance_item`, `products_data`.`in_stock` = `products_data_temp`.`in_stock` WHERE `products_data`.`hash` = `products_data_temp`.`hash`
    Step 3 will help keep all products but update their in_stock information. This way you don't delete anything, only make in_stock info current.

  15. #15
    ABW Ambassador kse's Avatar
    Join Date
    November 29th, 2005
    Posts
    2,511
    I have only stated working with whole datafeeds the last few weeks and I currently creating a new Temp table from sevarl feeds and over writing my master.

    My question is and I do not see it being addressed above is what if records have been removed from the feed if you do not overwrite the master feed how are you removing the products that are no longer included in the feeds??? Did I miss something above???
    MERCHANTS: Start showing your coupons directly on your site, that way your shoppers will stop leaving your site looking for them!! If not then remove your Coupon Box!!

  16. #16
    Member soda's Avatar
    Join Date
    March 26th, 2010
    Location
    Irvine, CA
    Posts
    36
    Quote Originally Posted by kse View Post
    My question is and I do not see it being addressed above is what if records have been removed from the feed if you do not overwrite the master feed how are you removing the products that are no longer included in the feeds??? Did I miss something above???
    It's addressed in steps 3 and 5:

    1) You mark ALL products as 'not in stock'

    2) Create temp table with products from new feeds (in which you logically have products that ARE in stock)

    3) You mark only products that are present in temp table as 'in stock'

    For example, original 'master' table has 3 products: A - in stock, B - in stock, C - in stock;
    You change their 'in stock' to 'not in stock';
    Your datafeed has 2 poducts: A, C
    You update A = in stock, C = in stock and as a result you get:
    A - in stock, B - not in stock, C - in stock

  17. Newsletter Signup

+ Reply to Thread

Similar Threads

  1. Link to Import Datafeed
    By bluewaves1 in forum Programming / Datafeeds / Tools
    Replies: 11
    Last Post: March 13th, 2008, 07:02 PM
  2. Help with datafeed import
    By adsw in forum Programming / Datafeeds / Tools
    Replies: 8
    Last Post: September 26th, 2006, 02:05 PM
  3. Help with datafeed import
    By adsw in forum Midnight Cafe'
    Replies: 4
    Last Post: September 25th, 2006, 06:13 PM
  4. Datafeed Import Optimization
    By Snib in forum Programming / Datafeeds / Tools
    Replies: 31
    Last Post: May 8th, 2006, 02:41 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •