1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to scrape Twitter URLs from a list of URLs?

Discussion in 'Social Networking Sites' started by hackedd, Aug 21, 2013.

  1. hackedd

    hackedd Junior Member

    Joined:
    Aug 11, 2010
    Messages:
    138
    Likes Received:
    172
    Gender:
    Male
    I have a list of website from which I want to get the twitter URLs. I want to know if it`s possible to automate the process?

    Example:
    Code:
    [TABLE="width: 291"]
    [TR]
    [TD]careerbuilder.com[/TD]
    [/TR]
    [TR]
    [TD]carefair.com[/TD]
    [/TR]
    [TR]
    [TD]carepages.com[/TD]
    [/TR]
    [TR]
    [TD]caring.com[/TD]
    [/TR]
    [TR]
    [TD]carmitimes.com[/TD]
    [/TR]
    [TR]
    [TD]carolinalive.com[/TD]
    [/TR]
    [TR]
    [TD]carolinascw.com[/TD]
    [/TR]
    [TR]
    [TD]carreview.com[/TD]
    [/TR]
    [/TABLE]
    
    Thanks in advance.
     
  2. divok

    divok Senior Member

    Joined:
    Jul 21, 2010
    Messages:
    1,015
    Likes Received:
    634
    Location:
    http://twitter.com/divok
    Elaborate please ?
    anyways whatever your problem is , regular expression + python will scrape and automate the task.
     
  3. hackedd

    hackedd Junior Member

    Joined:
    Aug 11, 2010
    Messages:
    138
    Likes Received:
    172
    Gender:
    Male
    I need just their twitter page URLs
    eg:

    Code:
    [TABLE="width: 311"]
    [TR]
    [TD]https://twitter.com/careerbuilder[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carefair[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carepages[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/caring[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carmitimes[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carolinalive[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carolinascw[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/CarReview[/TD]
    [/TR]
    [TR]
    [TD]https://twitter.com/carsdotcom[/TD]
    [/TR]
    [/TABLE]
    
     
  4. divok

    divok Senior Member

    Joined:
    Jul 21, 2010
    Messages:
    1,015
    Likes Received:
    634
    Location:
    http://twitter.com/divok
    you will need to learn scrapy library of python , you can automate it also , just feed it the twitter url and it will save all the required data by you in a file .
     
  5. Raffy

    Raffy Regular Member

    Joined:
    Nov 30, 2012
    Messages:
    212
    Likes Received:
    613
    Here's a ruby script that should do what you need. Untested but it should work.

    Save the file as twitter_scraper.rb
    Run it with
    Code:
    ruby twitter_scraper.rb
    Code:
    require 'mechanize'
    
    array=[
            "http://careerbuilder.com",
            "http://carefair.com",
            "http://carepages.com",
            "http://caring.com",
            "http://carmitimes.com",
            "http://carolinalive.com",
            "http://carolinascw.com",
            "http://carreview.com"
            ]
    
    array.each do |f|
        begin
            agent = Mechanize.new
            agent.user_agent_alias = 'Windows Mozilla'
            agent.follow_meta_refresh = false
            agent.max_history=1
            page=agent.get("#{f}")
            twitter=page.link_with(:href=>/twitter.com/).href
            File.open('twitter_urls.txt', 'a+') {|file| file << "#{twitter}\n"}
            puts twitter
        rescue
            File.open('twitter_errors.txt', 'a+') {|file| file << "#{f}\n"}
            puts $!, $@
        end
    end
     
  6. hackedd

    hackedd Junior Member

    Joined:
    Aug 11, 2010
    Messages:
    138
    Likes Received:
    172
    Gender:
    Male
    Thanks but not works. "Cannot load such file -- mechanize (LoadError)"
     
  7. Raffy

    Raffy Regular Member

    Joined:
    Nov 30, 2012
    Messages:
    212
    Likes Received:
    613
    I should've explained it better. You need to install the Mechanize gem/library https://rubygems.org/gems/mechanize and then add your list of urls into that code starting on the 4th line. When you run it, it should output a text file with the twitter urls and a 2nd text file with any errors.

    PM me if you still need help.
     
    • Thanks Thanks x 1
  8.  ﴾͡๏̯͡๏﴿.tk

     ﴾͡๏̯͡๏﴿.tk Power Member

    Joined:
    Oct 6, 2010
    Messages:
    681
    Likes Received:
    114
    Occupation:
    ﴾͡๏̯͡๏﴿ ───█ ﴾͡●
    Location:
    twitter.com/SeX#.﴾͡๏̯͡๏﴿.
    Home Page:
    @raffi please explain a little more in the thread 2 or 3 installations?


    i get only the twitter_error with the url
     
  9. hackedd

    hackedd Junior Member

    Joined:
    Aug 11, 2010
    Messages:
    138
    Likes Received:
    172
    Gender:
    Male
    Any thoughts on why not saves the data in TXTs?