I'm working on a product section for my company's web sites. We use CJ, ShareASale, AvantLink and Linkshare. Very focused product groups, specific for each of our sites.

Here's my approach to dealing with data feeds. And I am curious to see your opinions about my setup. FYI I'm using PHP/MySQL.


1. Import feeds from diff. merchants on diff. networks to a single temporary products table using data mapping. Remove duplicate products (with same SKU).

2. Insert / update master product table.

3. I manage a category tree with multiple sets of keywords (positive, negative) according to which I categorize products in data feed ('LIKE' matching to concatenated product name + description + keywords (if they are in the feed).

4. In products table I add row with 'cleaned up' product name by removing unnecessary symbols and lowering case.

5. Finding similar products:
5.1. I iterate through all products in master table (grouped by merchant -- as I assume that after removing duplicates one merchant's feed contains only distinct products). I compare them by Levenshtein distance between cleaned up product titles. For products with 0 distance I create 'master' instance and link other products from different merchants to this instance. If distance is 1 or 2 -- I save them for manual review.

5.2. Manual review -- basically I look at products with similar names and mark them as match or not match. Now with a small # of such similar products it's ok, but it's definitely not ready for massive amounts of matches if we have hundreds of thousands / millions of products.

5.3. For matching products that have a 'master' instance, I select properties (e.g. description, images etc) from feeds that have overall better quality of these properties, e.g. merchant A has better images, merchant B has better descriptions - so I use respective properties from these merchants in creating the 'master' product instance.

It works BUT has challenges:
- Categorizing is hard (especially initially) because I have to maintain sets of keywords that would do a good job categorizing. With a larger category tree it will become very difficult.
- Categorizing is long (SQL 'LIKE' is very slow even on fully indexed values)
- Manual product match confirmation is labor-intensive. I can surely switch to fully automated and use only 0-1 distance matches and assume they're same products -- but I will still end up with bad results (e.g. 'Model 2008' and 'Model 2009' have a distance of 1 but clearly they're not same). At the other hand, if I use very strict (0) distance as prerequisite for a match -- I will be missing many matches that have typos etc.

Anyway, this is pretty much how far I am at the moment.
Let me know if you have any good advises. Thank you!