Page 1 of 2 12 LastLast
Results 1 to 25 of 26
  1. #1
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Running through large datafeeds
    I'm trying to parse some huge datafeeds, we're talking upwards of 1gb. It's become a bigger snag than I thought because PHP simply can't open them. My solution was to `split` the files into smaller chunks then parse each of those. While it works flawlessly, it hits the server hard. Hard enough to take the entire site down for upwards of an hour. This is because all of our pages are dynamic. We don't do any sort of front end caching. Adding that is a technical feat in itself, but only at the cost of time. I think the best solution is to move mySQL off the Apache server and throw some additional ram in the new mySQL server (maybe 6gb?). Then have the Apache server split while the mySQL server handles the dynamic content. What's better? Woodcrest or Dual Xeon?

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  2. #2
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    Snib my friend,

    I am going to address the bigger issue. If this is one time parsing i.e. prep work (or occasional parsing) why are you doing this on the server? Why don't you do it on your local machine so all of the impact is local. Just curious.
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  3. #3
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by patrick24601
    Snib my friend,

    I am going to address the bigger issue. If this is one time parsing i.e. prep work (or occasional parsing) why are you doing this on the server? Why don't you do it on your local machine so all of the impact is local. Just curious.
    This will happen multiple times daily. It's all automated. Using my local machine would create a major bottleneck in the entire process.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  4. #4
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    Still churning here.... Even if you set up something like a local server (not your primary PC) that could receive the data, parse through it like you are talking about, and then send it all back. You could automatie that. But then you have the issue of network traffic. Hmmm. I'll think about this and pass it around to some guys here.

    BTW... I have *No* idea on your original question
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  5. #5
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by patrick24601
    Still churning here.... Even if you set up something like a local server (not your primary PC) that could receive the data, parse through it like you are talking about, and then send it all back. You could automatie that. But then you have the issue of network traffic. Hmmm. I'll think about this and pass it around to some guys here.

    BTW... I have *No* idea on your original question
    I'm leaning towards what you're saying by adding another dedicated server to the mix. That way they can be cross connected for snappy data connections. I'd never host a business machine at home because I have no control over power or internet outages. Plus who wants a bunch of servers humming away all day and night at home. If I were to buy my own machine I'd co-locate it in a server farm.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  6. #6
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    But even if you move the mysql off to another server it sounds like that will still be your data parsing server. Won't you still have the issue of the server being bogged down with all of that parsing? If the Apache Server comes over to the Mysql server and requests something that MySQL server is going to be pretty busy with the parsing.

    It seems to me you'd want your parsing on a completely different server than any of your production servers. I understand you not wanting to host it at home. But you may want to seperate those two functions altogether - Production and Parsing.

    So maybe adding another server, and making that other server just for parsing and testing. Then you could leave the live site and it's MySQL on your existing server.

    Is that what your original proposition was?
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  7. #7
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    I see your point and my solution is to separate the file splitting from the mySQL server. To do this I'll break up the files on the Apache server. Here the load should be lower and it won't affect the mySQL processes on the other server. With a cross connection I can import the smaller files into the mySQL server quickly.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  8. #8
    Full Member jollygoodpirate's Avatar
    Join Date
    January 18th, 2005
    Location
    NC
    Posts
    227
    Quote Originally Posted by Snib
    I'm trying to parse some huge datafeeds, we're talking upwards of 1gb. It's become a bigger snag than I thought because PHP simply can't open them. My solution was to `split` the files into smaller chunks then parse each of those. While it works flawlessly, it hits the server hard. Hard enough to take the entire site down for upwards of an hour. This is because all of our pages are dynamic. We don't do any sort of front end caching. Adding that is a technical feat in itself, but only at the cost of time. I think the best solution is to move mySQL off the Apache server and throw some additional ram in the new mySQL server (maybe 6gb?). Then have the Apache server split while the mySQL server handles the dynamic content. What's better? Woodcrest or Dual Xeon?

    - Scott
    Hey Scott,

    I am running a dual xeon and it does well. I have found that the extra ram is great, but the disk is a HUGE bottle neck. You may want to consider doing a RAID config specially set for throughput. I am a fan of SCSI so I did a SCSI RAID 0 which gives me split throughput among disks. It works good for me.

    On the question of the over 1Gig datafeeds, here is how I handled it. I uncompressed the file and piped it to the php script which read from STDIN all on the fly. That way the php file does not attach to the file, and I can parse files much larger than 1Gigabytes. I read I could compile php to open files over 1Gig, but this works fine. Again, I think your disk will be your big bottleneck.

    Here is the pipe scheme I use:
    gzip -dc datafeedfile.gz | ./processfeed.php

    Maybe you had not thought about that.....

    Good Luck

    --
    Fernando

  9. #9
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    OK. Snib - we are on the same page.

    Something else to think about - is PHP the best tool for the parsing? I know linux has some powerful text parsing tools. But I am not sure PHP is one of them. Just a thought.
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  10. #10
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    My question is why you're doing the processing over the web with PHP instead of just processing it directly on the server?

    I do all of my processing directly on my database server (which is separate from my web servers). All of the processing is designed to not "lock" the tables and to be "CPU friendly". After processing every X records, I check the load average and pause if it's hammering the server too hard.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  11. #11
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by jollygoodpirate
    On the question of the over 1Gig datafeeds, here is how I handled it. I uncompressed the file and piped it to the php script which read from STDIN all on the fly. That way the php file does not attach to the file, and I can parse files much larger than 1Gigabytes. I read I could compile php to open files over 1Gig, but this works fine. Again, I think your disk will be your big bottleneck.

    Here is the pipe scheme I use:
    gzip -dc datafeedfile.gz | ./processfeed.php
    Thank you Jolly,

    I did come across that solution in my Google search, but my problem is my XML parser can't handle it. I'm using XMLReader which handles large files better than a DOM XML solution. It works for reasonably large XML files, but not over 1gb. Are you doing XML parsing or flat text? I think your solution would work well if there was a PHP XML library that could handle it.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  12. #12
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    Michael,

    I believe he is processing it on the database server. His database server is a hosted server that is running mysql and apache on the web. And all of the processing is bogging down apache and mysql.

    Interesting thought checking the CPU load though . Can you give some more detail and maybe a PHP call of how you check the CPU load?

    Patrick
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  13. #13
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by MichaelColey
    My question is why you're doing the processing over the web with PHP instead of just processing it directly on the server?

    I do all of my processing directly on my database server (which is separate from my web servers). All of the processing is designed to not "lock" the tables and to be "CPU friendly". After processing every X records, I check the load average and pause if it's hammering the server too hard.
    I'm processing it directly on my server. It's just that splitting the file and creating hundreds of smaller XML files hammers the server and prevents mySQL from operating smoothly. Most of the CPU is being eaten up by the command line `split` command which takes at least an hour to break up a 3gb file.

    I just had an idea based on Jolly's suggestion. I can read the file in from the command line STDIN and send chunks of XML to XMLReader as I go along. This will avoid overloading XMLReader with the entire contents of the file. And I can also avoid loading the entire file into memory.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  14. #14
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    Alright Jolly!
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  15. #15
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by patrick24601

    Interesting thought checking the CPU load though . Can you give some more detail and maybe a PHP call of how you check the CPU load?

    Patrick
    I watch 'top' from the command line and I use "show processes" in mySQL to make sure it's not backlogging. mySQL is usually the first indicator because I can see what's not getting done. When I've got 30-40 incomplete SQL queries things lock up.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  16. #16
    ABW Ambassador PatrickAllmond's Avatar
    Join Date
    September 20th, 2005
    Location
    OKC
    Posts
    1,219
    But snib it would be interesting if you could build the check into the PHP and make it do some auto throttling.
    ---
    This response was masterly crafted via the fingers of Patrick Allmond who believe you should StopDoingNothing starting today.
    ---
    Focus Consulting is where I roll | Follow @patrickallmond on Twitter
    Search Engine Marketing | Search Engine Optimization | Social Media | Online Video

  17. #17
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    I don't use PHP. I use Perl.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  18. #18
    Full Member jollygoodpirate's Avatar
    Join Date
    January 18th, 2005
    Location
    NC
    Posts
    227
    Here is how I do it...
    Quote Originally Posted by Snib
    Thank you Jolly,

    I did come across that solution in my Google search, but my problem is my XML parser can't handle it. I'm using XMLReader which handles large files better than a DOM XML solution. It works for reasonably large XML files, but not over 1gb. Are you doing XML parsing or flat text? I think your solution would work well if there was a PHP XML library that could handle it.

    - Scott
    My true loader is a c program that parses csv. So technically what I posted was an xml to csv translator. The csv file after filtering and all the xml stuff is well under 1G. Then I can use my regular loader without problem.

    My xml2csv parser works like this.

    Code:
    if (!($fp=fopen('php://stdin',"r"))){
            print "Die no xml file\n";
    }
    while ($data=fread($fp,50000000))
      {
      xml_parse($parser,$data,feof($fp)) or
    For further details I would have to go back to what I did, I don't recall all of the details..

  19. #19
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by patrick24601
    But snib it would be interesting if you could build the check into the PHP and make it do some auto throttling.
    Not sure what you mean here.

    Quote Originally Posted by MichaelColey
    I don't use PHP. I use Perl.
    I considered perl, but I read that even perl can't open a file this large. It might work with the STDIN solution, but I'm more comfortable writing in PHP. We'll see how this goes.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  20. #20
    Moderator MichaelColey's Avatar
    Join Date
    January 18th, 2005
    Location
    Mansfield, TX
    Posts
    16,232
    No perl can handle large files just fine. You wouldn't want to read the whole thing in at once, though. Just process it line by line.
    Michael Coley
    Amazing-Bargains.com
     Affiliate Tips | Merchant Best Practices | Affiliate Friendly? | Couponing | CPA Networks? | ABW Tips | Activating Affiliates
    "Education is the most powerful weapon which you can use to change the world." Nelson Mandela

  21. #21
    Affiliate Manager adambha's Avatar
    Join Date
    October 20th, 2006
    Posts
    301
    Quote Originally Posted by MichaelColey
    No perl can handle large files just fine.
    I also understand this to be very true.

    I knew a perl programmer who processed massive text files and swore that his perl scripts would index, sort, and search through the data faster than an Oracle database. Whether that is true or not is debatable, of course.

  22. #22
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    I read through this and he said he had problems with perl as well:

    http://boonedocks.net/mike/archives/...Linux-2.4.html

    Seems like the best solution for perl and PHP with 1-2gb+ files is to use STDIN.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  23. #23
    Fear and Arrogance jrrl's Avatar
    Join Date
    January 18th, 2005
    Location
    Pittsburgh
    Posts
    485
    From the PHP documentation:

    No external libraries are needed to build this extension, but if you want PHP to support LFS (large files) on Linux, then you need to have a recent glibc and you need compile PHP with the following compiler flags: -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64.
    So, I guess it is doable, if you roll your own PHP (which is not that hard).

    -John.
    There's a reason army's wear uniforms even though it makes them easier to spot. Sometimes that's what you want. Uniforms suggest organization, power, and numbers. These, in turn, inspire fear. And, as any good operative knows, there is no more effective weapon than fear.

    Hosting Comparison - HostScope - jrrl.com

  24. #24
    ABW Ambassador Snib's Avatar
    Join Date
    January 18th, 2005
    Location
    Virginia
    Posts
    5,303
    Quote Originally Posted by jrrl

    So, I guess it is doable, if you roll your own PHP (which is not that hard).

    -John.
    I came across this solution, but if you read the link in my previous post he mentioned that there is some missing functionality when you do this.

    In any event I did a cat with STDIN on a PHP script and it worked like a charm. It took about an hour and a half to get through a 3gb file, but the server wasn't inhibited to the point that it locked up.

    - Scott
    Hatred stirs up strife, But love covers all transgressions.

  25. #25
    Full Member jollygoodpirate's Avatar
    Join Date
    January 18th, 2005
    Location
    NC
    Posts
    227
    Smile I knew it'd work
    I am glad it is working for you!
    Are you still going to move to a bigger server? If you did this on a Xeon server with more memory and RAID setup, you could see an improvement on the amount of time it takes.

+ Reply to Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. Featured: Way to Get Large Datafeeds Compressed Like Sears etc?
    By glittered in forum Programming / Datafeeds / Tools
    Replies: 15
    Last Post: June 28th, 2014, 01:11 PM
  2. Will the datafeeds in the merchants datafeeds thread track my commissions?
    By john9245 in forum Programming / Datafeeds / Tools
    Replies: 5
    Last Post: March 29th, 2005, 09:42 AM
  3. Split up the directories when working with large datafeeds?
    By Nintendo in forum WebMerge (Fourthworld.com)
    Replies: 1
    Last Post: August 15th, 2004, 05:32 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •