1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Is there a way to bypass wordpress cache?

Discussion in 'General Programming Chat' started by IMShane, Nov 4, 2011.

  1. IMShane

    IMShane Junior Member

    Joined:
    Sep 20, 2011
    Messages:
    131
    Likes Received:
    23
    I'm trying to scape the latest news from a big wordpress blog, but that site is heavily cached which prevents me from doing so. Is there a way to bypass such kind of cache? Maybe some kind of secret GET parameter or similar? I've tried feeds but not working.

    Would any experienced programmer give some ideas, please?

    Thanks
     
  2. Xyz01

    Xyz01 Regular Member Premium Member

    Joined:
    Aug 8, 2011
    Messages:
    300
    Likes Received:
    126
    Is it server side cache or client side cache?
     
  3. tiagorossi

    tiagorossi Junior Member

    Joined:
    Aug 2, 2011
    Messages:
    150
    Likes Received:
    43
    try to clear the browser cache or install another browser.
    i know its a very bad issue.
     
  4. IMShane

    IMShane Junior Member

    Joined:
    Sep 20, 2011
    Messages:
    131
    Likes Received:
    23
    I'm afraid it's a server side cache, as I always get same old content even use newly created browser instance in python although at the time a new post must have been published. I'm actually trying to find a way to scrape the latest post once it's published, but there's always a lag.
     
  5. redlaunch

    redlaunch Registered Member

    Joined:
    May 5, 2011
    Messages:
    51
    Likes Received:
    16
    Quick answer:
    if ur using a bot/script try a post request instead of a get request.

    Long answer:
    (This will get rid of server side cache)

    Busting cache depends on a lot of thing. You need to take a look at the headers and understand what cache instructions are being sent to the browser.

    Specially look at the Vary: header. This indicates how the cache can vary and when it might be needed to get a fresh copy.

    popular vales and what to do about it:
    Vary: Accept-Encoding what to do:Change the accept-encoding of ur brwser
    Vary: Accept-Encoding,User-Agent :: try a different browser
    etc etc

    If there is some sort of timer for the cache then changing your system time to a future time will help.


    And offcourse u can always completely turn of your cache !
    in FF goto: about:config
    search for cache
    set: browser.cache.memory.enable to false
    and also: browser.cache.disk.enable to false
     
    • Thanks Thanks x 2
  6. IMShane

    IMShane Junior Member

    Joined:
    Sep 20, 2011
    Messages:
    131
    Likes Received:
    23
    I'm using my own python script to do so, but what post request can you use for a wordpress blog? I don't know any...

    Thanks for your suggestion, I will try to look closely into header right now!

     
  7. Hostwinds

    Hostwinds Power Member UnGagged Attendee Enterprise Member

    Joined:
    May 17, 2010
    Messages:
    776
    Likes Received:
    550
    Occupation:
    C.E.O.
    Location:
    Seattle
    Home Page:
    try requesting the files directly rather than going through the TLD use the IP address MANY SS caching programs rely on the TOLD request, and if you request it directly off the server it will give you the freshest content
     
    • Thanks Thanks x 2
  8. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,607
    Likes Received:
    11,185
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    This is a very interesting question. :01:

    Caching can mean a lot of thing. There are many many implementations and approaches.

    For example, if the site is using mysql query caching, this is based on size, so a possible way to bypass it is to request for pages that have not been recently viewed, thus (given there are a lot of pages) overflowing the cache size and throwing the cached results of the target page out of it. :hump:

    If the site is using page cache, based on time expiration - you probably can't do anything. This can be either to internal implementation (see some WP caching plugins) or due to an external proxy (like Squid). The trick hostwinds mentioned would fall in this category and I was not ware of its existence. If the domain is not on a dedicated ip though, it will probably fail as the web server will probably deliver a test page or a random domain hosted on that ip.

    If the site is using object cache, depending on the implementation, it probably has either size or time limits. So, the overflowing method might work again.
     
    • Thanks Thanks x 1
  9. IMShane

    IMShane Junior Member

    Joined:
    Sep 20, 2011
    Messages:
    131
    Likes Received:
    23
    Thanks a lot for all the ideas!

    I finally came up with an solution and it works! Just want to share with you guys: first I took a close look into the header as redlaunch suggested and found out they are using varnish for cache control and there's a 600s interval, so you won't get anything new if you keep requesting the same url in this period of time. But I noticed it actually redirect wrong request (not existing page / GET parameter) to homepage, so basically you can just add whatever not-exist parameter to the url in the end and varnish would consider this as a new request, therefore you always get the fresh content!

    Hah, cache busting is kinda interesting!
     
    • Thanks Thanks x 1