![]() |
Thread: How are you updating you datafeed sites? |
|
Tools | Search |
|
#26
|
||
|
Quote:
Maybe some ideas can be extracted where appropriate.Hmm.. Tricky, Tricky question to answer in general.. The correct answer for how to load data-feeds depends on how you plan to use them, and where you see your site(s) going longer term. I currently load 350 datafeeds (and growing) of many sizes, checking for and downloading updated feeds every few hours, 100% automated but under my watchful eye. ![]() Here are some of my experiences from building my current system.
Soo.. On the original question. "How do you update your site": I run a lot of robotic scripts in a controlled "assembly" line that crosses multiple servers. On each "station", the robots can clone them-selves up to a configured number (depending on server load) to run multiple similar type jobs in parallel. The main criteria was always to as much as possible eliminate churn on the databases running the web-sites. Hence I have a separate server doing massive amounts of data manipulation, before sending only delta-updates to the web/databases. (This is why I hate large data-feed churn. :-) ) Since prices change much more often than other product info, I separate various change types into separate streams. The overall loading steps (which coordinates their work between them as they must happen in certain sequences) are largely:
FROM HERE ON OUT the assembly line runs in parallel, but separately, on multiple servers (scalable): When the update robots on each server detect that all the necessary pieces are available (on each data- and/or web-server separately), it updates the actual databases seen by the web-site. In most cases these updates for a feed runs for only a few seconds, because of the delta structure. There are of course also other unrelated robots, that maintain statistics tables (used to speed up web-site), sync control tables between servers, cache cleanups, and other data maintenance. Plus I can command the robots from the outside, start/stop assembly line stations, and scale up/down the robot clone counts, if I need the disks to slow down or speed up their clicking. Goals and mileage will vary, but in my case I wanted to separate data-churn from web-site churn. So far, this has worked for me.. It also eliminates a lot of data network traffic from disturbing the web servers. It does, however, require separate server(s) that spend basically 24 hours a day manipulating data. And more development effort than most affiliates would want to or are able to put into it. Across PHP, Shell, C, and SQL there are many thousands of lines of code in this assembly line. But maybe some of the ideas can be used on a smaller scale. My worst enemy is junk-data from merchants and close to no data-validation on the networks' parts.. Neither direct validation of data, nor programmed in enforcement/validation of the data manually entered by merchants or affiliate managers. Not always easy to avoid such data. ![]() To all the affiliate managers and merchants participating or lurking on this site from an automation freak/geek: Sorry about the long response.. My answers to things are probably over-complicated for the need of most... ![]() DeeCee |
||
|
||
| Thanks From: |
|
#27
|
||
|
Quote:
Thats why I decided up front to separate it from sharing space with a web-server, and only send delta's on to the main data-server. The system I currently run is set up to support a theoretically larger number of different sites, because the presentation part is the third piece I did not mention earlier. Each site can run on the local web server or on any external server I allow to connect. Currently I use a basic Joomla install as the basic site control, and then developed a large Joomla component that does most of the display work, and a set of Joomla plugins and modules to control both display and the interface to data. One plugin for example is the Data Connector, where I can select connectivity dependent on whether my local "direct DB connect" would work or theoretically a SOAP or REST controller for remote sites is installed manage the connectivity and site permissions. Install the Lego's on a new Joomla site, set the configuration for restrictions on product types, categories, and such, and it sort of runs itself.. One site could be for example restricted to support clothing data, another could be supporting only pet products, auto products, and so on. I am also doing some Wordpress plugins/widgets that can connect to the same data source so I can show my "own" random ads on blog sites and such. To prevent from too much data pushing across the net and from databases, I developed a data-caching system that filters the data requests at the individual web-server. Each type of data (configuration info, country codes, prices, product info, categories, and so on) have different allowed cache time limits that allows the web-server to prevent talking to the data server for repeating data. In case of for example a promotion, where the same web-page (and hence data) would load repeatedly, the data loads up locally from either APC, Memcache, or a simple text file. Makes the web-site faster, and also prevents a lot of load on the database. Category classification (merging vendor classifications and moving products around between categories is a real bi***. A bunch of PHP and hundreds of regular expressions rule-sets that filter the incoming stuff initially. I then have utilities that allow me to kill or move categories, or restructure category trees to look the way I want it, plus it will add more "rules" to prevent the same mistake from happening again. The moves takes a little manual thinking when I bother to go through that exercise to clean things up. The automatic categorization runs on its own. The "repairs" when the rules makes mistakes take my manual intervention (telling it where to move a tree to.). (Not good, I know.. I prefer pushing out data and code, and let the web-sites run themselves.) But that part is necessary. Moving a category tree can cause hundreds of automatic changes to cascade through the category paths. On the back-end of that, the system is setup to automatically track those cat moves and pass them to server(s) in control tables, so the web-server knows where and when to do 301 redirects. This is for when people are sent to the site through categorization pages by search engines. To prevent from hitting a 404 error when a sub-category got killed or moved, the data connector returns a special result to the web-server that tells it that the data moved, and where to redirect the category to. So invisibly to the user they are automatically redirected to the new location/page with a 301. Eventually on their re-checks, Google/Bing then catches up to my funky changes and the redirects are no longer necessary. FYI. Another problem i ran into very early was the issue of managing the urls correctly to maintain search engine rankings and prevent a lot of 404's in Google/Bing when I changed something. (Yes, some of my pre-planning sucked. Mainly Google played tricks with me. My first version of a URL structure made Google assign some "main keywords" I did not want..) I fixed that by having each object (product, category, ...., ...) know its own expected "address" or url. So if I change the way things work, everything automatically fixes itself over time. If a user or a search engine enter the system by an old URL, the object itself recognizes the mistake and orders a redirect to the right location. Fixes the user perception and fixes search engines (over time). Sorry. As always I get too wordy.. The brakes on my typing and brain-farts ain't working. On a site-note. This is all now kinda running itself. I got mad at all the scammers, spammers, and hackers of the world, and took a long "vacation" from this to develop security software and email/Blog/Robot traps around the net. Putting together a DNSBL and spam-blockers that I now use for email and web-server control to catch and block bad guy, and their beneficiaries. To any lurking hackers/spammers: I dislike your methods to an extreme degree.. The past week I have been fighting BurstNet's NOC on a cluster of hacker bots that are still to this minute running SQL Injection attacks across all domains from all BurstNet Scranton, PA networks. 24 hours a day trying to hack my user tables since last week. They will never get through, though. But BurstNet apparently do not want to loose the hacker business, because they have done nothing to stop it, despite it being a simple pattern to catch, and instead they closed my incident in their tracking system. Just FYI on BurstNet. The hackers are still blasting away, happy as clams. No success with US BurstNet this week, but did manage to shut down a phishing site in Germany (got their site and domain shutdown), and decoded phishing code on domains in Russia running scams on Australians.. Only the the Internet.. Travel the world and without even leaving the couch. :-) |
||
|
||
![]() |
«
Previous Thread
|
Next Thread
»
| Tools | Search |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Adding/Updating Sites | glittered | AffiliateFuture | 2 | December 29th, 2006 05:57 PM |
| updating a site that i built from a datafeed with webmerge | hungirl | Datafeeds | 4 | February 17th, 2004 09:58 AM |


Maybe some ideas can be extracted where appropriate.





