Web Crawling by Selenium

머신 러닝 training을 위한 이미지 크롤링 라이브러리 google_images_downloads 가 잘 실행되지 않아서 확인해 보았습니다.
2020년 2월부터 Google images DOM이 image element class=”rg_meta notranslate”에서 “rg_i Q4LuWd” 형식으로 변경되면서 더 이상 실행이 되지 않는 문제 대안으로 Selenium을 이용한 크롤링 간편화한 gids 패키지를 이용하는 방법을 사용하였습니다..

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')    

def main(args):
    query = "typical face"
    query = query.split()
    query = '+'.join(query)
    url = "https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    soup = get_soup(url, headers)
    for a in soup.find_all("img", {"class": "rg_i"}):
        wget.download(a.attrs["data-iurl"], a.attrs["data-iid"])


if __name__ == '__main__':
    from sys import argv
    try:
        main(argv)
    except KeyboardInterrupt:
        pass
    sys.exit()

Install Package

pip install gids

Usage

Example code

from gids import builder

config = {
    'driver_path': '/usr/local/bin/chromedriver',
    'headless': True,
    'window-size': '720x480',
    'disable_gpu': False
}

first_item = {
    'keyword': 'White Shark',
    'limit': 10, # The number of images
    'download_context': './data',
    'path': 'animal' # save in ./data/animal/img_01...10
}

second_item = {
    'keyword': 'Whale Shark',
    'limit': 10, # The number of images
    'download_context': './data',
    'path': 'Shark' # save in ./data/plant/img_01...10
}

items = [first_item, second_item]

downloader = builder.build(config)

downloader.download(items)

Result


    Loading Pages. This may take a few moments...
    Page Scroll done...
    Start to downloading
    Downloading ...White Shark - [./data/animal/White Shark/img_0]
    Downloading ...White Shark - [./data/animal/White Shark/img_1]
    Downloading ...White Shark - [./data/animal/White Shark/img_2]
    Downloading ...White Shark - [./data/animal/White Shark/img_3]
    Downloading ...White Shark - [./data/animal/White Shark/img_4]
    Downloading ...White Shark - [./data/animal/White Shark/img_5]
    Downloading ...White Shark - [./data/animal/White Shark/img_6]
    Downloading ...White Shark - [./data/animal/White Shark/img_7]
    Downloading ...White Shark - [./data/animal/White Shark/img_8]
    Downloading ...White Shark - [./data/animal/White Shark/img_9]
    White Shark download completed. [Successful count = 10].
    Loading Pages. This may take a few moments...
    Page Scroll done...
    Start to downloading
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_0]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_1]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_2]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_3]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_4]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_5]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_6]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_7]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_8]
    Downloading ...Whale Shark - [./data/Shark/Whale Shark/img_9]
    Whale Shark download completed. [Successful count = 10].
    Total time is 36.451616048812866 seconds.

Troubleshooting

아래와 같이 에러가 나올 때 chromedriver파일이 있는 디렉토리로 PATH를 설정해줌.

No found chromedriver in this environment. Install on your machine. exception: Message: ‘chromedriver’ executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

export PATH=$PATH:/usr/local/bin/chromedriver

Leave a comment