[python] beautifulsoup 네이버 기사 스크래핑

DATA/Python

[python] beautifulsoup 네이버 기사 스크래핑

웅덩이 2022. 4. 15. 00:18

'토트넘'을 네이버 뉴스에서 검색하여 5페이지를 beautifulsoup으로 스크래핑하겠습니다.

스크래핑할 것들은 기사 제목, 기사 미리보기 내용, 기사 이미지 입니다.

📌 설치 모듈

pip install beautifulsoup4
pip install requests

먼저 beautifulsoup으로 원활하게 크롤링을 하기 위해서는 tag와 class를 구분할 줄 아는 것이 좋습니다.

참고하실 분들은 참고하세요!!

[python] 웹페이지 크롤링 태그, 클래스, 아이디

크롤링을 할 때 가장 기본적으로 알아야할 것은 태그, 클래스, 아이디를 구분하는 것입니다. 크롤링하고자 하는 웹 페이지에 들어갑니다. 저는 네이버 뉴스에서 '토트넘'을 검색하여 제목, 내용,

puddle-of-devstory.tistory.com

📌저는 5p를 반복하여 스크래핑할 예정이기 때문에 url에서 규칙을 찾아줘야 했습니다.

그래서 1p, 2p, 3p를 눌러보며 규칙을 찾아줬습니다.

- 2p : "https://search.naver.com/search.naver?

where=news&sm=tab_pge&query=%ED%86%A0%ED%8A%B8%EB%84%98&sort=0&photo=0&field=0&pd=0&ds=&

de=&cluster_rank=28&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so:r,p:all,a:all&

start=11"

-3p : "https://search.naver.com/search.naver?

where=news&sm=tab_pge&query=%ED%86%A0%ED%8A%B8%EB%84%98&sort=0&photo=0&field=0&pd=0&ds=&

de=&cluster_rank=40&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so:r,p:all,a:all&

start=21"

- 다시 1p : "https://search.naver.com/search.naver?

where=news&sm=tab_pge&query=%ED%86%A0%ED%8A%B8%EB%84%98&sort=0&photo=0&field=0&pd=0&ds=&

de=&cluster_rank=50&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so:r,p:all,a:all&

start=1"

달라진 부분과 규칙을 찾으셨나요?

먼저 달라지는 부분은 rank와 start 부분입니다. 하지만 누를 때마다 rank는 계속 달라졌습니다. 위의 결과처럼 3p 에서 1p로 이동했을 때 rank가 수가 커진 것을 확인 할 수 있습니다.

여러 시도 끝에 rank 값이 변해도 페이지의 변화를 가져오지 않는다는 것을 알 수 있었습니다.

두번째 달라지는 부분은 start 입니다. start는 규칙을 가지고 있는 것을 확인할 수 있습니다.

1p일 때 start=1, 2p일 때 start=11, 3p일 때 start = 21

📌 x = 0,1,2,3,4 일 때, 10x+1로 계산을 하면 start의 값을 얻을 수 있습니다!!

일단 저는 크롤링을 할 때 요소 하나를 주피터 노트북으로 확인합니다.

# 첫 번째 기사 제목
soup.select("a.news_tit")[0].text

# 첫 번째 기사 내용 미리보기
soup.select("a.api_txt_lines.dsc_txt_wrap")[0].text

# 첫 번째 기사 이미지
soup.select("a.dsc_thumb > img")[0]["src"]

그리고 for문에 넣어야합니다. 저에게 필요한 for문은

① 페이지 수 - 5p를 반복하여 수집

② 한 페이지 내에서 각각의 뉴스

title = []
article = []
img = []

# 5페이지 반복하여 수집
for i in range(5): 
    url = 'https://search.naver.com/search.naver?
    where=news&sm=tab_pge&query=%ED%86%A0%ED%8A%B8%EB%84%98&sort=0&photo=0&field=0&pd=0&ds=&de=&cluster_rank=28
    &mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so:r,p:all,a:all&start='+str(10*i+1)
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    
     # 한 페이지 내에 수집 가능한 뉴스 개수 (제목 수, 본문 수 개수 구분 가능하면 상관 없음)
    for j in range(len(soup.select("a.news_tit"))):
        title.append(soup.select("a.news_tit")[j].text)
        article.append(soup.select("a.api_txt_lines.dsc_txt_wrap")[j].text)
        img.append(soup.select("a.dsc_thumb > img")[j]["src"])
        
df = pd.DataFrame()
df['title'] = title
df['article'] = article
df['img'] = img

df

나온 데이터 프레임의 일부입니다. 5p를 스크래핑했더니 50개의 행 데이터가 나왔습니다!!