Results 1 to 16 of 16
March 12th, 2010, 08:15 PM #1
Best way to import datafeed for performance
- Join Date
- November 3rd, 2009
I'm building a platform which basically imports datafeeds into my database which will be used for a variety of websites.
The database will contain products from different network, merchants etc
Every time I import a feed I loop through all the products in it, for every product I check if it's already in the database or needs to be updated, then I add/update it.
This will put a lot of stress on the database
I've seen some examples which simply import everything to a temp table and, once finished, this table is renamed as the main one. This is surely good when you use a single datafeed, but when you have many datafeeds (hence you are updating just a fraction of the records) you would need to copy all the current main table records to a temp table, import/update the feed and then rename this one to be used as main one.
what could be a smart solution to keep good db performance during feed updates?
March 12th, 2010, 09:10 PM #2
What you're doing is the right way to do it. It's not THAT much of a stress.
A couple additional things I would recommend:
1) Put a "circuit breaker" in it. After every 500 or 1000 records, either have it pause for a few seconds or have it check the system load and pause until it drops to a reasonable level.
2) When you process the records for a merchant, make sure you handle deletes. What you described doesn't (although you might have just simplified the description). If there are records in your database that aren't in the new datafeed, those need to be "deleted". I suggest that you first identify the scope of the deletes, and if it's excessive (like over 5% or 10% of the merchant's products), email a warning and abort. Also, it's often best to leave the deletes and just flag them as deleted. Sometimes they come back, and perhaps you want to leave a placeholder on your site and just use AdSense or something on those pre-existing pages rather than deleting/breaking them.
March 13th, 2010, 12:59 AM #3
1. Inner join both tables and update the products based on whatever criteria you're using to determine if the record needs to be updated.
2. An outer join on the two tables. Whatever isn't in the main table can just all be inserted with one INSERT statement.
There's no need to loop through the records at all.
March 13th, 2010, 05:52 AM #4
- Join Date
- November 3rd, 2009
thanks for the suggestions
@MichaelColey: I was thinking to insert breaks every 1000 records
Instead as far as it regards deletes I just created a status field in the database that I would use as 1 = active, 0 = not active, 2 = whatever else might be necessary, but I had not thought about how to handle deletes in details.
Good recommendation for the scopes of the deletes.
@vicjg: my db primary key is the product ID ( auto increment numeric field generated on my end, not related to datafeeds)
I can identify a specific product by feedID + sku ( to check if a product needs to be updated I check the lastupdated date for every datafeed line searching for feedID + sku, where feedID is not the product id but the ID of the feed: eg. laptops by merchant XXX), but the feeds are plain text files with no id before being processed and added to the database.
your solution seems very interesting but I need some guidance to fully understand what you mean:
are you talking about a couple nested queries, something like this?
assume 'tba' is my live products table, 'tbb' is a temp table where the feed has been imported, tba.feedid is the feed ID in my database:
UPDATE tba SET tba.name = tbb.name, tba.manufacturer = tbb.manufacturer, tba.price = tbb.price ............ WHERE tba.id IN (SELECT tba.id FROM tba,tbb WHERE tba.feedid = X AND tba.sku = tbb.sku AND tba.lastupdate < tbb.lastupdate)
INSERT into tba ( SELECT * FROM tbb WHERE tbb.sku NOT IN (SELECT sku FROM tba WHERE feedid = X))
I don't even know if something like this can be done
would such queries be extremely resource intensive? some feeds have tens of thousands of products
March 13th, 2010, 09:56 AM #5
A key of merchant_id (your internal ID for the merchant) and sku (the unique product number from the merchant) is probably the best. An internal sequential ID is pretty useless.
March 13th, 2010, 11:18 AM #6
- Join Date
- March 13th, 2010
I maintain a table containing datafeed filenames and timestamps. The first check is to import a datafeed only if the timestamp has changed.
Secondly, the data is chunked of course... can't be loading 500mb datafeeds into memory.
The record is checked on unique product codes and the values from existing record is saved in array. If no matching record is found, and Insert operation happens.
If existing record is found, Values from datafeed record are saved in a separate array. If the Mysql values array and Datafeed values Array are not equal, then record needs to be updated.
Works well, its not blindingly fast... but does its job without crashing the db server.
March 13th, 2010, 11:33 AM #7
The best way to get a lot of rows into a table quickly is to avoid any intermediate index rebuilds. With MySQL, I use "LOAD DATA INFILE" and reference an external text file. I've also used another method at times, which is to lock the tables before and chunked the inserts without doing the batch load from file. The intent is basically the same, don't let the DB do anything until I'm done inserting.
BTW, my feed architecture is not dynamic. I basically pre-build everything with no DB normalization (speed, speed, speed), then send a read-only copy into production. So my suggestions might not be best for feed architectures that require very regular insertion.
My goal has always been to scale out my front end servers so that they may handle 100 or more requests per second in order to minimize my hardware costs.Merchants, any data you provide to Google Shopping should also be in your affiliate network datafeed. More data means more sales!
March 13th, 2010, 12:03 PM #8
- Join Date
- September 7th, 2007
- Cuautitlán, Edo. de México
"I've seen some examples which simply import everything to a temp table and, once finished, this table is renamed as the main one. This is surely good when you use a single datafeed, but when you have many datafeeds (hence you are updating just a fraction of the records) you would need to copy all the current main table records to a temp table, import/update the feed and then rename this one to be used as main one."
no, you don't need to do that.
You don't need to load all the main records.
you could 1. make the temp table, 2. delete records matching merchant you are importing from main table. 3. insert temp table into main table. 4. drop temp table. 5. optimize main table (since the deleted entries create gaps)
March 13th, 2010, 12:37 PM #9
I have a database table called "products". It has a primary key (merchant_id + product_id), that contains a list of products.
Now I get a datafeed from Merchant X (merchant_id=123). I'll dump the feed into a temp table called #products_123.
In T-SQL I'd run something like this for the update:
UPDATE products SET Description (whatever fields you wanted to update) =
#products_123.Description FROM products INNER JOIN #products_123 ON products.product_id=products_123.product_id
where products.merchant_id=123 AND products.Timestamp<#products_123.Timestamp (if timestamp was the criteria we'd use to determine if a record changed)
The insert would look like this: (this is only one way of doing it, an outer join would work too)
INSERT INTO products (merchant_id, product_id, ...)
SELECT 123,product_id, ... FROM #products_123 WHERE
product_id (NOT IN SELECT product_id FROM products where merchant_id=123)
BTW I'll be sending my consulting invoice# through PM
March 13th, 2010, 05:11 PM #10
Be careful with anything that updates a large number of records in a "live" table as a group. Depending on what you're doing and how you're doing it, it can lock the table and cause any processes trying to access it to hang until the update is complete. Nothing worse than having a page take several minutes to render because an update is going on.
March 16th, 2010, 02:05 PM #11
- Join Date
- November 3rd, 2009
March 16th, 2010, 02:09 PM #12
I typically keep my "web content" separate from the master tables. For instance, I'll have a master table of all products, and a separate table for "products" for the website.
This way, I can get rid of unneeded columns for the web tables and speed it up and when it comes to updates, I can do all the updates on the master and at the last minute, fill a temp table and rename it for the web.
April 15th, 2010, 02:26 AM #13
May 25th, 2010, 01:41 PM #14updating existing feeds
Re. updating existing feeds, here's my approach:
1) Create (or truncate) temporary products table for products from current merchant's datafeed:
TRUNCATE TABLE products_data_temp
2) Parse / process data table into temporary products table
3) In your current 'master' products table set all in_stock info to 'no', e.g.
UPDATE products_data SET in_stock = 0 WHERE 1
INSERT IGNORE INTO `products_data` (`hash`, `merchant_id`, `name`, `name_slug`, `brand`, `brand_slug`, `manufacturer`, `sku`, `upc`, `description`, `description_short`, `category`, `sub_category`, `product_group`, `keywords`, `name_keywords`, `price`, `retail_price`, `sale_price`, `clearance_item`, `currency`, `in_stock`, `condition`, `images`, `buy_url`) SELECT `hash`, `merchant_id`, `name`, `name_slug`, `brand`, `brand_slug`, `manufacturer`, `sku`, `upc`, `description`, `description_short`, `category`, `sub_category`, `product_group`, `keywords`, concat(name,' ',keywords) as name_keywords, `price`, `retail_price`, `sale_price`, `clearance_item`, `currency`, `in_stock`, `condition`, `images`, `buy_url` FROM `products_data_temp` WHERE 1
UPDATE `products_data`, `products_data_temp` SET `products_data`.`price` = `products_data_temp`.`price`, `products_data`.`retail_price` = `products_data_temp`.`retail_price`, `products_data`.`sale_price` = `products_data_temp`.`sale_price`, `products_data`.`clearance_item` = `products_data_temp`.`clearance_item`, `products_data`.`in_stock` = `products_data_temp`.`in_stock` WHERE `products_data`.`hash` = `products_data_temp`.`hash`
May 25th, 2010, 02:12 PM #15
I have only stated working with whole datafeeds the last few weeks and I currently creating a new Temp table from sevarl feeds and over writing my master.
My question is and I do not see it being addressed above is what if records have been removed from the feed if you do not overwrite the master feed how are you removing the products that are no longer included in the feeds??? Did I miss something above???MERCHANTS: Start showing your coupons directly on your site, that way your shoppers will stop leaving your site looking for them!! If not then remove your Coupon Box!!
May 25th, 2010, 02:26 PM #16
1) You mark ALL products as 'not in stock'
2) Create temp table with products from new feeds (in which you logically have products that ARE in stock)
3) You mark only products that are present in temp table as 'in stock'
For example, original 'master' table has 3 products: A - in stock, B - in stock, C - in stock;
You change their 'in stock' to 'not in stock';
Your datafeed has 2 poducts: A, C
You update A = in stock, C = in stock and as a result you get:
A - in stock, B - not in stock, C - in stock
By bluewaves1 in forum Programming / Datafeeds / ToolsReplies: 11Last Post: March 13th, 2008, 07:02 PM
By adsw in forum Programming / Datafeeds / ToolsReplies: 8Last Post: September 26th, 2006, 02:05 PM
By adsw in forum Midnight Cafe'Replies: 4Last Post: September 25th, 2006, 06:13 PM
By Snib in forum Programming / Datafeeds / ToolsReplies: 31Last Post: May 8th, 2006, 02:41 PM