TicketMaster data quality is bad
Hi - I'm an affiliate newbie, but an old database guy. I'm working on importing the TicketMaster data feeds, specifically venues, artists and concert tickets.
First of all, CONGRATS and great kudos to TM for having a separate list of venues and artists in the first place. That's awesome - and from what I've seen so far, they're unique in that.
Boooo, however, on their data quality. I'm seeing a LOT of duplicates in their venue and artist feeds. Which means I'll have to develop algorithms to de-dupe them after importing to my database if I wish to offer high quality to MY visitors (er, theirs, too in this case).
Thing is, removing basic duplicates at this level is not that hard. I'm not asking for perfection - I can appreciate that managing consumer-generated content is a real challenge. But seems like they could do more cleanup before exporting their data to us. Simply grouping by address, city, state, zip and the "dupechecked" venue name would be better than this.
Dupecheck fields are what I call fields where you remove the non-alpha and non-digit chars from the venue name for comparison. "Will's Pub" and "Wills Pub" then become "willspub".
Anyone else encounter this headache?