1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

Scrape Youtube subtitle for fresh content. One last issue

Discussion in 'YouTube' started by yagami-iori, Jan 11, 2019.

  1. yagami-iori

    yagami-iori Junior Member

    Joined:
    Nov 13, 2009
    Messages:
    104
    Likes Received:
    272
    Hi,

    I've created a bot that extracts Youtube videos cc. The only problem is that the text extracted doesn't contain punctuation, capitals or line breaks.

    Do you know if with NLTK (or any other NLP) & Python I can format a raw text?

    I've gone through the documentation but I can't find anything that would help me with this task.
    I may share the bot for everyone as I need it for a short period of time to get a site approved by Adsense.

    Example:

    Input:

    python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community

    Output:

    Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. In July 2018, Van Rossum stepped down as the leader in the language community.

    Thank you,
     
  2. Larry Igna

    Larry Igna Regular Member

    Joined:
    Nov 25, 2016
    Messages:
    244
    Likes Received:
    224
    Gender:
    Male
  3. Cyberars

    Cyberars Jr. VIP Jr. VIP

    Joined:
    Nov 5, 2011
    Messages:
    699
    Likes Received:
    263
    Occupation:
    YouTube Views Main Provider
    Location:
    Since 2006
    Home Page:
    just out of curiosity... what do you gain from extracting CC's from videos?
    this is only for unique text content?
     
  4. yagami-iori

    yagami-iori Junior Member

    Joined:
    Nov 13, 2009
    Messages:
    104
    Likes Received:
    272
    Thank you @Larry Igna I've looked into spacy but didn't find what I want. Anyway, I've opted to only scrape videos with a caption. This solves my problem and there are plenty of them :)
    @Cyberars Yes, for fresh and quality content in a matter of seconds.
     
  5. Cyberars

    Cyberars Jr. VIP Jr. VIP

    Joined:
    Nov 5, 2011
    Messages:
    699
    Likes Received:
    263
    Occupation:
    YouTube Views Main Provider
    Location:
    Since 2006
    Home Page:
    nice idea, but how does a bot can extract cc from videos that don't have captions?
     
  6. yagami-iori

    yagami-iori Junior Member

    Joined:
    Nov 13, 2009
    Messages:
    104
    Likes Received:
    272
    It can't, It only extract those with captions or auto-generated captions