python href 크롤링

2019. 8. 13. 15:26

https://www.2025ad.com/news

https://observer.com/2019/08/self-driving-fed-policy-trump-kills-obama-era-advisory-group/

첫 번째 사이트에서 url과 title

두 번재 사이트에서 content를 긁어보겟다.

# -*- coding: utf-8 -*-
"""
Created on Tue Aug 13 14:09:57 2019

@author: User
"""

import json
import requests
import datetime
import random
import time
from bs4 import BeautifulSoup
from selenium import webdriver

# 셀레늄 기본 ..
driver = webdriver.Chrome(r'C:\Users\User\Desktop\chromedriver_win32\chromedriver')
driver.get('https://www.2025ad.com/news')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
# time.sleep(5)

# request 기본.
# url = 'https://www.2025ad.com/news'
# html = requests.get(url).text
# soup = BeautifulSoup(html, 'html.parser')
# time.sleep(5)

# url 크롤링 하기.
list_urls = []
list_titles = []
for i in soup.find_all("a", {'class':'hs-rss-title'}) :
    list_titles.append(i.getText())
    list_urls.append(i['href'])

for i in list_urls:
    print(i)

for i in list_titles:
    print(i)

# 내용 크롤링하기
driver.get('https://observer.com/2019/08/self-driving-fed-policy-trump-kills-obama-era-advisory-group/')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

list_contents = []
contents = driver.find_elements_by_tag_name('p')
for i in contents :
   list_contents.append(i.text)
print(list_contents)

# 수집한 url 바탕으로 url 별 내용 크롤링
total_content = []
for i in list_urls:
    driver.get('{0}'.format(i))
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    list_contents = []
    contents = driver.find_elements_by_tag_name('p')
    for i in contents :
       list_contents.append(i.text)

    total_content.append(list_contents)

print(total_content)

# 다른방식으로도 사용할 수 있는 코드
##################################################
url_contents = driver.find_elements_by_class_name('hs-rss-title')

list_urls = []
for i in url_contents :
list_urls.append(i.get_attribute('href'))

#post-1194880 > div > p:nth-child(2)
list_contents = []
title = ''
for i in soup.find_all("div", {'class':'content-area'}) :
    list_contents.append(i.getText())


for i in list_contents:
    print(i)

soup.select('#post-1194880 > div > p:nth-of-type(2)')

################################################

불러오는 중입니다...

'Study > Code' 카테고리의 다른 글

파이썬 이중 리스트 데이터 프레임으로 변환 및 csv 저장, 로드 시 문제점 (1)	2022.01.07
PIL Image Crop & Paste (0)	2020.09.18
python list 중복 값 카운터하기 (0)	2019.02.22
numpy를 이용해 새로운 array 만들기. (0)	2019.02.16
Python Asterisk 가변인자. args, *kargs (0)	2018.12.28

Deeppp

python href 크롤링

'Study > Code' 카테고리의 다른 글

+ Recent posts

티스토리툴바