1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Best way to convert a URL to a unique hash to make it shorter (if possible)

Discussion in 'General Programming Chat' started by tb303, Oct 14, 2014.

  1. tb303

    tb303 Power Member

    Joined:
    Dec 18, 2011
    Messages:
    601
    Likes Received:
    280
    Im working on a little side programing thing that is scraping urls to a list. It works fine until it gets into very large numbers of urls and the ram usage gets too high.

    I thought it might be more efficient to somehow hash the the url to make it shorter an use less ram.
    But is it actually possible to make a *unique* reversible hash from a URL that is a lot shorter than its source?
     
  2. m4dm4n

    m4dm4n Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 15, 2010
    Messages:
    221
    Likes Received:
    92
    Occupation:
    /dev/full
    Location:
    /dev/urandom
    the way i usually do this is sha256 or sha512 the url and store the full url in a file named something/.../.../.../thehash

    ... coresponds to the first, the second and the third letter of the hash ( i do this to make sure i don't get a folder with millions of files )
     
    • Thanks Thanks x 1
  3. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    Buffer X URLs in memory and then dump them to a database.
     
    • Thanks Thanks x 1
  4. tb303

    tb303 Power Member

    Joined:
    Dec 18, 2011
    Messages:
    601
    Likes Received:
    280
    thanks for the quick replys :)

    Interesting solution. There shouldnt be too much over head when checking if a url already exists either.


    This is what I had started to do but was worried it might still get problems when checking if a url already exists.
     
  5. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    Create unique index on the URL column in the database.
    Not the ideal solution, but it'll work.
     
    • Thanks Thanks x 1
  6. m4dm4n

    m4dm4n Jr. VIP Jr. VIP Premium Member

    Joined:
    Sep 15, 2010
    Messages:
    221
    Likes Received:
    92
    Occupation:
    /dev/full
    Location:
    /dev/urandom
    It really depends on your number of urls and QPS (queries per second)
    if you're in the millions ballpark with urls and 100-500 qps then a db should be fine, any more than that and you need to think outside of the box.
     
  7. sanishan

    sanishan Newbie

    Joined:
    Mar 2, 2009
    Messages:
    24
    Likes Received:
    7
    Occupation:
    FreeLancer
    Location:
    Out of Space
    Well as a developer, there is few workground you may apply

    Like most of your URL must have same domain name

    like blackhatworld become

    Become bhw

    This will little bit shorter,

    Once your required to revert you know that BWH become BlackHatWorld.

    Thanks,
     
  8. webwizkidz

    webwizkidz Junior Member

    Joined:
    Jul 4, 2014
    Messages:
    132
    Likes Received:
    6

    When you want to decrypt it just replace BHW with BlackHatWorld :)