Python 데이터 수집과 크롤링 - alpaco | 데이터 수집과 크롤링

데이터 수집

데이터 수집 방법

데이터 플랫폼
크롤러 프로그램

데이터 플랫폼

데이터 포털: AIHub, 공공데이터 포털
인공지능 경진대회: Kaggle, 데이콘

클라이언트와 서버

클라이언트: 서비스 요청

Request header

서버: 서비스 제공

Response header

크롤링

클라이언트 종류

웹브라우저
크롤러

URL
데이터 파싱

HTML 핵심정리

태그의 계층구조
태그: <시작태그> + 하위태그(or text) + </끝태그>
시작태그: 이름 + 속성
속성: 태그의 세부정보 ex: class

파이썬 라이브러리

requests

웹페이지 접근
resp = requests.get(url)

BeautifulSoup

데이터 파싱
soup = BeautifulSoup(resp.content,'lxml')
tags = soup.select('a')

URL 이해하기

url = 프로토콜 :// 주소값 / 리소스경로 ? 쿼리스트링

타겟 태그

태그만
속성

class

a.b
a태그의 class 속성이 b
속성에 띄어쓰기가 들어있는 경우 '.'으로 변환한다.

a#b
a태그의 id 속성이 b

크롤링 라이브러리 활용

1. get: 데이터 가져오기

import requests

from bs4 import BeautifulSoup

1 resp = requests.get(url)

# type(resp) -> response 오브젝트의 html 타입

# 결과: Response [200]

> 2xx : 성공 (대표 : 200)

> 4xx, 4xx : 실패 (대표 : 404, 403, 402, 503)

2. select: 데이터 파싱

2-0 디코딩

# resp.text -> str

# resp.content -> bytes

# resp.content.decode('utf-8')

2-1 soup = BeautifulSoup( resp.content, 'lxml')

# type(soup) -> BeautifulSoup 오브젝트의 lxml 타입

# 타입 종류: lxml, html.parser, html5lib

2-2 tags = soup.select(' ')

# type(tags) -> 태그 오브젝트의 [] 리스트 묶음

태그를 찾는다.

soup.select('a')

클래스를 참조할 수 있으면 한다.

soup.select('a.keyword')
a태그의 keyword클래스

정확하게 계층을 명시한다.

태그 (공백) 태그
soup.select('strong a')
<strong class='tit-g clamp-g'>
<a></a>
</strong>

2-3 태그 오브젝트 출력함수

오브젝트.text

태그 안에 있는 text 출력

오브젝트['']

태그의 속성 값 출력

크롤링 실습

딕셔너리에 저장하기

import requests

from bs4 import BeautifulSoup

url = ' '

rspe = requests.get( url )

soup = BeautifulSoup( rspe.content, 'lxml' )

a_tags = soup.select('strong.tit-g.clamp-g a')

news_info = {'title' : [], 'url' : []}

for a_tag in a_tags:

news_info['title'].append(a_tag.text)

news_info['url'].append(a_tag['href'])

news_info

url에 있는 쿼리스트링에 딕셔너리 사용하기

for i in tqdm(range(20)):

page = i+1

url = 'https://www.aladin.co.kr/shop/common/wbest.aspx'

params = {

'BestType':'Bestseller',

'BranchType':1,

'CID':0,

'page':page,

'cnt':1000,

'SortOrder':1

}

resp = requests.get(url, params=params)

print(resp.url)

이미지 다운받기 ( rspe.content 사용 )

import requests

img_url = ' '

rspe = requests.get( img_url )

with open('img001.jpg', wb) as f

f.write( rspe.content )

기타 라이브러리

반복문 진행상태

from tqdm.notebook import tqdm

for i in tqdm( range(10), desc='크롤링 진행율' ):

시간차 (크롤링 거부 방지)

import time

time.sleep(2)

기타 원하는 태그/문구 변경 및 제거하기

        resp = requests.get(link, headers=headers)
        soup = BeautifulSoup(resp.content, 'lxml')

        tags_articletxt = soup.select('#articletxt')[0]


        # 태그 정제(제거)
        for tag in tags_articletxt.find_all(['figcaption', 'figure', 'strong']):
            tag.decompose()
        for tag in tags_articletxt.find_all('a', href='https://marketinsight.hankyung.com/'):
            tag.decompose()

        # h2 코드 처리 (마커 삽입 후 제거)
        for h2_tag in tags_articletxt.find_all('h2'):
            h2_tag.insert_before('@MARKER_h2@') #마커 앞뒤로 줄바꿈 유지 (줄바꿈 2개로 만든다.)
            h2_tag.unwrap() # h2 태그 통으로 제거

        # 텍스트로 바꾼 뒤, 나머지 정제(제거)
        text_articletxt = tags_articletxt.get_text(separator='\n', strip=False)
        text_articletxt = re.sub(r'.+기자.+@.+', '', text_articletxt) # 마지막 기자&이메일 문구 제거
        text_articletxt = re.sub(r'[ \xa0]+', ' ', text_articletxt) # 공백, 탭, &nbsp; 제거
        text_articletxt = re.sub(r'\n[\s\n]*\n', '\n', text_articletxt) # 2번 이상의 \n 제거

        # 마커 삽입 된 곳 처리
        text_articletxt = text_articletxt.replace('@MARKER_h2@', '') #마커 앞뒤로 줄바꿈 유지


        print(text_articletxt.strip())
        list_hankyung['contents'].append(text_articletxt.strip())

Python 데이터 수집과 크롤링 - alpaco

데이터 수집

데이터 수집 방법

데이터 플랫폼

클라이언트와 서버

크롤링

클라이언트 종류

HTML 핵심정리

파이썬 라이브러리

URL 이해하기

타겟 태그

크롤링 라이브러리 활용

1. get: 데이터 가져오기

1 resp = requests.get(url)

2. select: 데이터 파싱

2-1 soup = BeautifulSoup( resp.content, 'lxml')

2-2 tags = soup.select(' ')

2-3 태그 오브젝트 출력함수

크롤링 실습

딕셔너리에 저장하기

url에 있는 쿼리스트링에 딕셔너리 사용하기

이미지 다운받기 ( rspe.content 사용 )

기타 라이브러리

반복문 진행상태

시간차 (크롤링 거부 방지)

기타 원하는 태그/문구 변경 및 제거하기

댓글 쓰기