파이썬 웹 스크래핑 (BeautifulSoup)

import pandas as pd
import numpy as np
import seaborn as sns
import ssl
from bs4 import BeautifulSoup
import requests
import re

ssl._create_default_https_context = ssl._create_unverified_context

# 미국 ETF 리스트 가져오기
url = "https://en.wikipedia.org/wiki/List_of_American_exchange-traded_funds"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
rows = soup.select('div > ul > li')

etfs = {}
for row in rows:
    # print(row)
    try:
        etf_name = re.findall('^(.*) \(NYSE', row.text)
        etf_market = re.findall('\((.*)\|', row.text)
        etf_ticker = re.findall('NYSE Arca\|(.*)\)', row.text)

        # print(etf_name, etf_market, etf_ticker)

        if (len(etf_ticker) > 0) & (len(etf_market) > 0) & (len(etf_name) > 0):
            etfs[etf_ticker[0]] = [etf_market[0], etf_name[0]]

    except AttributeError as err:
        pass

# etfs 딕셔너리 출력
print(etfs)
print('\n')

# etfs 딕셔너리를 데이터프레임으로 변환
df = pd.DataFrame(etfs)
print(df)

https://en.wikipedia.org/wiki/List_of_American_exchange-traded_funds

List of American exchange-traded funds - Wikipedia

From Wikipedia, the free encyclopedia Jump to navigation Jump to search This is a table of notable American exchange-traded funds, or ETFs. As of 2015, the number of exchange-traded funds worldwide is over 4000,[1] representing about 2.88 trillion U.S. dol

en.wikipedia.org

이 페이지에서 "Stock ETFs" 영역을 읽어오는 코드이다. "div > ul > li" 태그를 조건으로 해당 영역 html 태그를 라인단위로 읽어오고 "re.findall('^(.*) \(NYSE', row.text)" 함수를 이용해서 정규식으로 원하는 단위로 잘라낼 수 있다. 이 샘플코드를 그대로 활용하면 다양한 목적으로 써먹을수 있을 것 같다.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

먼훗날

파이썬 웹 스크래핑 (BeautifulSoup)

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

파이썬 웹 스크래핑 (BeautifulSoup)

'Pandas' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역