Notices
Reply

Thread: How are you updating you datafeed sites?

 
Tools Search
  #26  
Old October 5th, 2011, 03:38 PM
Newbie
Join Date: October 5th, 2011
Location: Texas, USA
Posts: 15
Quote:
Originally Posted by kse View Post
I wondering how most people who have datafeed sites update their sites. ...
Thanks
Kevin……….
I am likely not like "most people", but here goes, for those who likes a long read. Maybe some ideas can be extracted where appropriate.

Hmm.. Tricky, Tricky question to answer in general..

The correct answer for how to load data-feeds depends on how you plan to use them, and where you see your site(s) going longer term.
I currently load 350 datafeeds (and growing) of many sizes, checking for and downloading updated feeds every few hours, 100% automated but under my watchful eye.
Here are some of my experiences from building my current system.
  • Loading a single or a couple of datafeeds can be handled by a manual download/extract, but:::
  • If you run your site on a shared web-server you will for large feeds likely have problems running into memory limits, CPU limits, time needs, and other such. I got my first dedicated server for that reason. Some feeds simply do not load efficiently within shared server limitations.
  • Many contracts you "signed" when joining merchants will contain rules for keeping products and prices updated on your site. As late as yesterday another merchant of mine "expired" all their affiliates to add rules into the contract around update frequency, which had to be accepted to re-join. Another thing to keep up with, before my automation detect the "lost" merchant and removes all their products. Apparently too many people are passing customers to their site based on old pricing. Of course ticking off their customers when they arrive on the merchant site and see a higher price than they saw on the affiliate web-site.
  • Keeping product information and prices updates for large and/or many data-feeds will create huge churn and bog down your web-server, UNLESS you prepare the data beforehand. (See below.) So a simple "extract" into a new table and rename it in place, only works for small amounts of data, and is in my experience not a long-term solution if you plan to "grow".
  • As mentioned many other places on this site, many merchants provide data-feeds that can be garbage, violate the affiliate network descriptions, or seem to have faulty extract mechanisms.
    Examples:
    Some (especially book stores in my experience), pass along some strange extract that change every day. (Criteria might be "top sellers", "low price", "junk we want to get rid of", whatever... I do not know. I have had feeds with 2-3 million products that literally entirely replace up to 1.9 Mill records in the feed every day. Meaning that I would have to delete ~2 mill and add ~2 Mill totally different products. That will get them booted from my system quickly.

    Others seem to have problems figuring out how to extract their data correctly. One I just found this morning (which made me come here) is a merchant with around 200,000 products, that every 2-3 days send a feed with only a sub-set (around 40,000 products). My systems fix everything up and remove 160,000 product records, just to detect on the next load that that all products have been re-added. This cycle repeats itself over and over.
  • Merging feed categories across merchants can be hell. I have multiple sets of rule systems to merge categories into my own "system" on the way in, that will rewrite, move, merge, based on a large set of rules, and it has to be kept updated as merchants change their categories, the format they send them in, and much else. Far from complete yet.

Soo.. On the original question. "How do you update your site":

I run a lot of robotic scripts in a controlled "assembly" line that crosses multiple servers. On each "station", the robots can clone them-selves up to a configured number (depending on server load) to run multiple similar type jobs in parallel.

The main criteria was always to as much as possible eliminate churn on the databases running the web-sites. Hence I have a separate server doing massive amounts of data manipulation, before sending only delta-updates to the web/databases. (This is why I hate large data-feed churn. :-) )
Since prices change much more often than other product info, I separate various change types into separate streams.

The overall loading steps (which coordinates their work between them as they must happen in certain sequences) are largely:
  1. On my main data-server (which runs no web-site or has customer access), a robot checks various ftp sites for updated info. Downloads only updates feeds.
  2. Loader robots detects the new downloaded feed(s), and convert them into a "private" DB format tables I made up based on specs from various networks, depending on the source. (CJ, SHAS, LS, all have different formats that have to be interpreted individually, not to mention vendor private feeds).
  3. A price delta robot detects the new loader tables, and calculates price delta tables that contain ONLY specifications for the products that have changes prices (and on products that must be deleted). Since this 'delta" only updates with prices (retail, base, sale, ...) it is quite compact and can run fast on the web-server when it gets there.
  4. Product delta robots take on the loader tables and create delta tables for detected main product differences (name, description, ...). Typically the number of changes here are much lower than for prices, unless the feed is one of the messy ones mentioned above. Separate sub-robots separately calculate differences in product-categories and base product-attributes (brand, manufacturer, size, expirations, and many others). Sometimes there are only price updates and no other product changes at all.
  5. Sync robots detect the new delta tables (minute in size compared to the full product feed) and uploads and inserts them on separate databases on the actual web-server and plugs them into its assembly-line as they arrive to get them ready for the remote robots.

FROM HERE ON OUT the assembly line runs in parallel, but separately, on multiple servers (scalable):

When the update robots on each server detect that all the necessary pieces are available (on each data- and/or web-server separately), it updates the actual databases seen by the web-site. In most cases these updates for a feed runs for only a few seconds, because of the delta structure.

There are of course also other unrelated robots, that maintain statistics tables (used to speed up web-site), sync control tables between servers, cache cleanups, and other data maintenance. Plus I can command the robots from the outside, start/stop assembly line stations, and scale up/down the robot clone counts, if I need the disks to slow down or speed up their clicking.

Goals and mileage will vary, but in my case I wanted to separate data-churn from web-site churn. So far, this has worked for me.. It also eliminates a lot of data network traffic from disturbing the web servers.

It does, however, require separate server(s) that spend basically 24 hours a day manipulating data. And more development effort than most affiliates would want to or are able to put into it. Across PHP, Shell, C, and SQL there are many thousands of lines of code in this assembly line. But maybe some of the ideas can be used on a smaller scale.

My worst enemy is junk-data from merchants and close to no data-validation on the networks' parts.. Neither direct validation of data, nor programmed in enforcement/validation of the data manually entered by merchants or affiliate managers. Not always easy to avoid such data.

To all the affiliate managers and merchants participating or lurking on this site from an automation freak/geek:
  1. Randomness is the extreme enemy of automation... When a merchant or manager suddenly decide to change the name of their feed file, change the content, change the format of their category structure, or other changes that might seem innocuous at the time, it causes my assembly line to have to learn new junk-detection rules. In some cases even manual (GASP!!) cleanup to remove stuff that lands in the wrong place because things changed with no notice. Largely I believe this is the fault of some of the data networks, who could very easily teach their systems and validations to carry the necessary detail so the data at least matches up, so when a marketing person suddenly type a long message into their feed file name it has no impact. But some of the networks seem to have little interest in this. (I even had a case couple of weeks ago, where I reported to a network that their advertised APIs were providing bad information because their servers were not sync'ed up between interactions, and the answer I got from them was "Yes, we know. It is bad data. It is a known BUG, and we have no current plan for when we will fix it."
    YIKES!!!!!!!! That caused me to have to invent more automation to "validate" whether the network API really, really meant the data it was returning, or whether to "fix" the API data for them.
  2. When you provide data feeds that change the actual product listed (SKUs carried) in random ways (more than normal adding/removing SKUS), you are defeating your purpose. When large sets of products appear and disappear overnight, it not only causes churn in my data, it also causes the web-crawlers (Big-G and others) to go nuts, when they hit 404 (not found) for products/pages that they had found just the day before.
Sorry about the long response.. My answers to things are probably over-complicated for the need of most...

DeeCee
  #27  
Old December 13th, 2011, 05:40 PM
Newbie
Join Date: October 5th, 2011
Location: Texas, USA
Posts: 15
Quote:
Originally Posted by wpbounce View Post
Oh wow. A lot of automation. I'm gonna separate feed updating process from production servers because nightly updates are killing the server and I have to reboot it every now and then.

Would you mind to share how many sites you have? Also what's your category classification techniques?
Yes, for large data-feeds doing updates based on the full feed can literally take a server to its knees. :-)
Thats why I decided up front to separate it from sharing space with a web-server, and only send delta's on to the main data-server.

The system I currently run is set up to support a theoretically larger number of different sites, because the presentation part is the third piece I did not mention earlier. Each site can run on the local web server or on any external server I allow to connect. Currently I use a basic Joomla install as the basic site control, and then developed a large Joomla component that does most of the display work, and a set of Joomla plugins and modules to control both display and the interface to data. One plugin for example is the Data Connector, where I can select connectivity dependent on whether my local "direct DB connect" would work or theoretically a SOAP or REST controller for remote sites is installed manage the connectivity and site permissions.
Install the Lego's on a new Joomla site, set the configuration for restrictions on product types, categories, and such, and it sort of runs itself..

One site could be for example restricted to support clothing data, another could be supporting only pet products, auto products, and so on. I am also doing some Wordpress plugins/widgets that can connect to the same data source so I can show my "own" random ads on blog sites and such.

To prevent from too much data pushing across the net and from databases, I developed a data-caching system that filters the data requests at the individual web-server. Each type of data (configuration info, country codes, prices, product info, categories, and so on) have different allowed cache time limits that allows the web-server to prevent talking to the data server for repeating data. In case of for example a promotion, where the same web-page (and hence data) would load repeatedly, the data loads up locally from either APC, Memcache, or a simple text file. Makes the web-site faster, and also prevents a lot of load on the database.


Category classification (merging vendor classifications and moving products around between categories is a real bi***.
A bunch of PHP and hundreds of regular expressions rule-sets that filter the incoming stuff initially.
I then have utilities that allow me to kill or move categories, or restructure category trees to look the way I want it, plus it will add more "rules" to prevent the same mistake from happening again. The moves takes a little manual thinking when I bother to go through that exercise to clean things up. The automatic categorization runs on its own. The "repairs" when the rules makes mistakes take my manual intervention (telling it where to move a tree to.). (Not good, I know.. I prefer pushing out data and code, and let the web-sites run themselves.) But that part is necessary. Moving a category tree can cause hundreds of automatic changes to cascade through the category paths.

On the back-end of that, the system is setup to automatically track those cat moves and pass them to server(s) in control tables, so the web-server knows where and when to do 301 redirects. This is for when people are sent to the site through categorization pages by search engines. To prevent from hitting a 404 error when a sub-category got killed or moved, the data connector returns a special result to the web-server that tells it that the data moved, and where to redirect the category to. So invisibly to the user they are automatically redirected to the new location/page with a 301. Eventually on their re-checks, Google/Bing then catches up to my funky changes and the redirects are no longer necessary.


FYI. Another problem i ran into very early was the issue of managing the urls correctly to maintain search engine rankings and prevent a lot of 404's in Google/Bing when I changed something. (Yes, some of my pre-planning sucked. Mainly Google played tricks with me. My first version of a URL structure made Google assign some "main keywords" I did not want..)
I fixed that by having each object (product, category, ...., ...) know its own expected "address" or url. So if I change the way things work, everything automatically fixes itself over time. If a user or a search engine enter the system by an old URL, the object itself recognizes the mistake and orders a redirect to the right location. Fixes the user perception and fixes search engines (over time).

Sorry. As always I get too wordy.. The brakes on my typing and brain-farts ain't working.

On a site-note. This is all now kinda running itself. I got mad at all the scammers, spammers, and hackers of the world, and took a long "vacation" from this to develop security software and email/Blog/Robot traps around the net. Putting together a DNSBL and spam-blockers that I now use for email and web-server control to catch and block bad guy, and their beneficiaries.

To any lurking hackers/spammers: I dislike your methods to an extreme degree.. The past week I have been fighting BurstNet's NOC on a cluster of hacker bots that are still to this minute running SQL Injection attacks across all domains from all BurstNet Scranton, PA networks. 24 hours a day trying to hack my user tables since last week. They will never get through, though. But BurstNet apparently do not want to loose the hacker business, because they have done nothing to stop it, despite it being a simple pattern to catch, and instead they closed my incident in their tracking system. Just FYI on BurstNet.
The hackers are still blasting away, happy as clams.

No success with US BurstNet this week, but did manage to shut down a phishing site in Germany (got their site and domain shutdown), and decoded phishing code on domains in Russia running scams on Australians..
Only the the Internet.. Travel the world and without even leaving the couch. :-)
Join ABW to remove this sponsored message.
Reply

Tools Search
Search:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Adding/Updating Sites glittered AffiliateFuture 2 December 29th, 2006 05:57 PM
updating a site that i built from a datafeed with webmerge hungirl Datafeeds 4 February 17th, 2004 09:58 AM


Content Relevant URLs by vBSEO ©2011, Crawlability, Inc.