1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Help me speed up my bot - rewards available

Discussion in 'Scripting' started by AVeryWetFish, May 16, 2017.

  1. AVeryWetFish

    AVeryWetFish Newbie

    Joined:
    Mar 8, 2017
    Messages:
    2
    Likes Received:
    0
    Gender:
    Male
    I’ve got a custom written script that is pulling data from a website fairly frequently. The website is massive, most of you would have heard of it - though I strongly doubt anyone here has done anything like this. I've hit a point where I'm at a loss how to speed things up (latency is very important for me). From what I can see by manipulating and mutating some URL parameters I can avoid being returned cached data from Akamai - but I've noticed strange behaviour I can't explain.

    From my experiments/experience if I run the script using the same pool of proxies using multiple threads, querying around once per a second (per thread) there can be 3-20 seconds delays between each thread acknowledging the change (while still querying once a second). Also, even once a thread gets the newest data, the next query from the same thread (but different proxy) can return the old data. This is a pretty massive delay for me as I'm being beaten by competitors. The cache headers returned from the query show it’s probably not being cached (Expiry - Date = 90) which matches the cache control they're using.

    List of things I've tried (with limited success):
    1. The obvious no-cache headers
    2. Appending a random parameter/value to the URL (has no effect - I have to mutate and randomly generate the url to avoid Akamai's cache)
    3. Try different locations for the proxies (EU/USA/Russia/China - no single location seems to be the best.)

    (I've tried countless other things - but won't list them here. I'm hoping for a niche thing that someone with a lot of experience could know to help).

    My understanding of distributed systems is limited and I might be missing something obvious - I originally assumed the data might be distributed, but from small things I’ve been noticing I’m not sure if this is the case.

    My code is private and I’m not going to share any of it here, sorry. The code itself is solid and there’s little I can do to speed it up. The program does the following:
    • Spawn multiple threads for each item type I’m searching for.
    • Each thread runs independently, using the shared pool of proxies that rotates after a single use.
    • Getting data from a url (in JSON)
    • Decoding and processing anything new from within it.

    For any reasonable suggestion that makes an noticeable speed/performance improvement I’d happily send of a minimum of $50 BTC, probably more depending on the effect. Post any suggestions/tips below and I’ll let you know how they go.
     
  2. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    766
    Likes Received:
    275
    Location:
    PHP Scripting ;)
    When you say script, are we talking about a PHP script? I guess yes. Have you done any unit tests or bench marking to see what takes the most time?
    What level of optimization are we talking about here. Shaving seconds of a process which takes a minute? Or shaving seconds of a process which just takes few seconds? The latter would be hard obviously.

    We should start from benchmarking to see where the bottleneck is and start from there. Also is the process data write intensive? Is DB the actual bottle neck? Or pulling data is? Are you using anything like Gearman for handling threads?

    Well, a lot of things that can matter.